linear classifiers - university of washington...2 regularized linear classifier with gradient ascent...
TRANSCRIPT
1/25/2017
1
CSE 446: Machine Learning
Linear classifiers: Overfitting and regularization
©2017 Emily Fox
CSE 446: Machine LearningEmily FoxUniversity of WashingtonJanuary 25, 2017
CSE 446: Machine Learning
Logistic regression recap
©2017 Emily Fox
1/25/2017
2
CSE 446: Machine Learning3
Thus far, we focused on decision boundaries
©2017 Emily Fox
#awesome
#aw
ful
0 1 2 3 4 …
0
1
2
3
4
…
Score(x) > 0
Score(x) < 0
Score(xi) = w0 h0(xi) + w1 h1(xi) + … + wD hD(xi)
= wTh(xi)
Relate Score(xi) to P(y=+1|x,ŵ)?⌃
CSE 446: Machine Learning4
Compare and contrast regression models
• Linear regression with Gaussian errors
yi = wTh(xi) + εi εi ~ N(0,σ2)
p(y|x,w) = N(y; wTh(x), σ2)
• Logistic regression
©2017 Emily Fox
2σ
yµ = wTh(x)
-1 +1 yP(y|x,w) =
1 .
1 + e- w h(x)T
e- w h(x).
1 + e- w h(x)T
T
y = +1
y = -1
1/25/2017
3
CSE 446: Machine Learning5
Understanding the logistic regression model
©2017 Emily Fox
Score(xi) P(y=+1|xi,w)
P(y=+1|xi,w) = sigmoid(Score(xi)) = 1 .
1 + e- w h(x)T
CSE 446: Machine Learning6
= estimate of class probabilities
If > 0.5:
ŷ= Else:
ŷ=
©2017 Emily Fox
Sentence from
review
Input: x
Predict most likely class
P(y|x)⌃
P(y=+1|x)⌃
Estimating improves interpretability:- Predict ŷ= +1 and tell me how sure you are
P(y|x)⌃
1/25/2017
4
CSE 446: Machine Learning
Gradient descent algorithm, cont’d
©2017 Emily Fox
CSE 446: Machine Learning8 ©2017 Emily Fox
Difference between truth and prediction
Sum over data points
Featurevalue
Step 5: Gradient over all data points
P(y=+1|xi,w) ≈ 1 P(y=+1|xi,w) ≈ 0
yi=+1
yi=-1
1/25/2017
5
CSE 446: Machine Learning9
Summary of gradient ascentfor logistic regression
©2017 Emily Fox
init w(1)=0 (or randomly, or smartly), t=1
while || ℓ(w(t))|| > εfor j=0,…,D
partial[j] =
wj(t+1) wj
(t) + η partial[j]
t t + 1
Δ
CSE 446: Machine Learning
Overfitting in classification
©2017 Emily Fox
1/25/2017
6
CSE 446: Machine Learning11
Learned decision boundary
©2017 Emily Fox
Feature ValueCoefficient
learned
h0(x) 1 0.23
h1(x) x[1] 1.12
h2(x) x[2] -1.07
CSE 446: Machine Learning12
Quadratic features (in 2d)
©2017 Emily Fox
Note: we are not including crossterms for simplicity
Feature ValueCoefficient
learned
h0(x) 1 1.68
h1(x) x[1] 1.39
h2(x) x[2] -0.59
h3(x) (x[1])2 -0.17
h4(x) (x[2])2 -0.96
1/25/2017
7
CSE 446: Machine Learning13
Degree 6 features (in 2d)
©2017 Emily Fox
Note: we are not including crossterms for simplicity
Feature ValueCoefficient
learned
h0(x) 1 21.6
h1(x) x[1] 5.3
h2(x) x[2] -42.7
h3(x) (x[1])2 -15.9
h4(x) (x[2])2 -48.6
h5(x) (x[1])3 -11.0
h6(x) (x[2])3 67.0
h7(x) (x[1])4 1.5
h8(x) (x[2])4 48.0
h9(x) (x[1])5 4.4
h10(x) (x[2])5 -14.2
h11(x) (x[1])6 0.8
h12(x) (x[2])6 -8.6
Score(x) < 0
Coefficient values getting large
CSE 446: Machine Learning14
Degree 20 features (in 2d)
©2017 Emily Fox
Note: we are not including crossterms for simplicity
Feature
ValueCoefficien
tlearned
h0(x) 1 8.7
h1(x) x[1] 5.1
h2(x) x[2] 78.7
… … …
h11(x) (x[1])6 -7.5
h12(x) (x[2])6 3803
h13(x) (x[1])7 21.1
h14(x) (x[2])7 -2406
… … …
h37(x) (x[1])19 -2*10-6
h38(x) (x[2])19 -0.15
h39(x) (x[1])20 -2*10-8
h40(x) (x[2])20 0.03
Often, overfitting associated with very large estimated coefficients ŵ
1/25/2017
8
CSE 446: Machine Learning15
Overfitting in classification
©2017 Emily Fox
Model complexity
Err
or
=
Cla
ssif
icat
ion
Err
or
Overfitting if there exists w*:
• training_error(w*) > training_error(ŵ)
• true_error(w*) < true_error(ŵ)
CSE 446: Machine Learning
Overfitting in classifiers Overconfident predictions
©2017 Emily Fox
1/25/2017
9
CSE 446: Machine Learning17
Logistic regression model
©2017 Emily Fox
-∞ +∞wTh(xi)
0.0 1.0
0.0
0.5
P(y=+1|xi,w) = sigmoid(wTh(xi))
CSE 446: Machine Learning18
The subtle (negative) consequence of overfitting in logistic regression
Overfitting Large coefficient values
ŵTh(xi) is very positive (or very negative) sigmoid(ŵTh(xi)) goes to 1 (or to 0)
Model becomes extremely overconfident of predictions
©2017 Emily Fox
1/25/2017
10
CSE 446: Machine Learning19
Effect of coefficients on logistic regression model
©2017 Emily Fox
w0 0
w#awesome +1
w#awful -1
w0 0
w#awesome +6
w#awful -6
w0 0
w#awesome +2
w#awful -2
#awesome - #awful #awesome - #awful#awesome - #awful
Input x: #awesome=2, #awful=1
CSE 446: Machine Learning20
Learned probabilities
©2017 Emily Fox
Feature ValueCoefficient
learned
h0(x) 1 0.23
h1(x) x[1] 1.12
h2(x) x[2] -1.07
1/25/2017
11
CSE 446: Machine Learning21
Quadratic features: Learned probabilities
©2017 Emily Fox
Feature ValueCoefficien
tlearned
h0(x) 1 1.68
h1(x) x[1] 1.39
h2(x) x[2] -0.58
h3(x) (x[1])2 -0.17
h4(x) (x[2])2 -0.96
CSE 446: Machine Learning22
Overfitting Overconfident predictions
©2017 Emily Fox
Degree 6: Learned probabilities Degree 20: Learned probabilities
Tiny uncertainty regions
Overfitting & overconfident about it!!!
We are sure we are right, when we are surely wrong!
1/25/2017
12
CSE 446: Machine Learning
Overfitting in logistic regression:Another perspective
©2017 Emily Fox
CSE 446: Machine Learning24
Linearly-separable data
Data are linearly separable if:• There exist coefficients ŵsuch that:
- For all positive training data
- For all negative training data
©2017 Emily Fox
Note 1: If you are using D features, linear separability happens in a D-dimensional space
Note 2: If you have enough features, data are (almost) always linearly separable training_error(ŵ) = 0
1/25/2017
13
CSE 446: Machine Learning25
Effect of linear separability on coefficients
©2017 Emily Fox
x[1]=#awesome
X[2
]=#
awfu
l
0 1 2 3 4 …
0
1
2
3
4
…
Score(x) > 0
Score(x) < 0 Data are linearly separable with ŵ1=1.0 and ŵ2=-1.5
Data also linearly separable with ŵ1=10 and ŵ2=-15
Data also linearly separable with ŵ1=109 and ŵ2=-1.5x109
CSE 446: Machine Learning26
Effect of linear separabilityon weights
©2017 Emily Fox
Data is linearly separable with ŵ1=1.0 and ŵ2=-1.5
Data also linearly separable with ŵ1=10 and ŵ2=-15
Data also linearly separable with ŵ1=109 and ŵ2=-1.5x109
x[1]= 2 x[2]= 1 X
[2]=
#aw
ful
x[1]=#awesome1 2 3 4 …0
0
1
2
3
4
…
P(y=+1|xi,ŵ)
Maximum likelihood estimation (MLE) prefers most certain model Coefficients go to infinity for linearly-separable data!!!
1/25/2017
14
CSE 446: Machine Learning27
Overfitting in logistic regression is “twice as bad”
Learning tries to find decision boundary that separates data
Overly complex boundary
If data are linearly
separable
Coefficients go to
infinity!
©2017 Emily Fox
ŵ1=109
ŵ2=-1.5x109
CSE 446: Machine Learning
Penalizing large coefficients to mitigate overfitting
©2017 Emily Fox
1/25/2017
15
CSE 446: Machine Learning29
Desired total cost format
Want to balance:
i. How well function fits data
ii. Magnitude of coefficients
Total quality =
measure of fit - measure of magnitude of coefficients
©2017 Emily Fox
(data likelihood)large # = good fit to
training data
large # = overfit
want to balance
CSE 446: Machine Learning30
Consider resulting objective
Select ŵto maximize:
©2017 Emily Fox
L2 regularized logistic regression
2ℓ(w) - ||w||2tuning parameter = balance of fit and magnitude
λ
Pick λ using:• Validation set (for large datasets)
• Cross-validation (for smaller datasets)
(as in ridge/lasso regression)
1/25/2017
16
CSE 446: Machine Learning31
Regularization
λ = 0 λ = 0.00001 λ = 0.001 λ = 1 λ = 10
Range of coefficients
-3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57 -0.05 to 0.22
Decision boundary
Degree 20 features, effect of regularization penalty λ
©2017 Emily Fox
CSE 446: Machine Learning32
Coefficient path
©2017 Emily Fox
λ
co
eff
icie
nts
ŵj
1/25/2017
17
CSE 446: Machine Learning33
Regularization λ = 0 λ = 0.00001 λ = 0.001 λ = 1
Range of coefficients
-3170 to 3803 -8.04 to 12.14 -0.70 to 1.25 -0.13 to 0.57
Learned probabilities
Degree 20 features: regularization reduces “overconfidence”
©2017 Emily Fox
CSE 446: Machine Learning
Finding best L2 regularized linear classifier with gradient ascent
©2017 Emily Fox
1/25/2017
18
CSE 446: Machine Learning35
Understanding contribution of L2 regularization
©2017 Emily Fox
- 2 λ wj Impact on wj
wj > 0
wj < 0
Term from L2 penalty
CSE 446: Machine Learning36
Summary of gradient ascent for logistic regression + L2 regularization
©2017 Emily Fox
init w(1)=0 (or randomly, or smartly), t=1
while not converged:
for j=0,…,D
partial[j] =
wj(t+1) wj
(t) + η (partial[j] – 2λ wj(t))
t t + 1
1/25/2017
19
CSE 446: Machine Learning
Summary of overfitting in logistic regression
©2017 Emily Fox
CSE 446: Machine Learning38
What you can do now…
• Identify when overfitting is happening• Relate large learned coefficients to overfitting• Describe the impact of overfitting on decision boundaries and
predicted probabilities of linear classifiers• Motivate the form of L2 regularized logistic regression quality metric • Describe what happens to estimated coefficients as tuning
parameter λ is varied• Interpret coefficient path plot• Estimate L2 regularized logistic regression coefficients using
gradient ascent• Describe the use of L1 regularization to obtain sparse logistic
regression solutions
©2017 Emily Fox
1/25/2017
20
CSE 446: Machine Learning
Linear classifiers: Scaling up learning via SGD
©2017 Emily Fox
CSE 446: Machine LearningEmily FoxUniversity of WashingtonJanuary 25, 2017
CSE 446: Machine Learning
Stochastic gradient descent:Learning, one data point at a time
©2017 Emily Fox
1/25/2017
21
CSE 446: Machine Learning41
Data
Why gradient ascent is slow…
©2017 Emily Fox
w(t)
Compute gradientPass
over data
w(t+1)Update
coefficients
w(t+2)Update
coefficients
Δ
ℓ(w(t))
Δ
ℓ(w(t+1))
Every update requires
a full pass over data
CSE 446: Machine Learning42
More formally: How expensive is gradient ascent?
©2017 Emily Fox
Contribution of data point xi,yi to gradient
Sum over data points
1/25/2017
22
CSE 446: Machine Learning43
Every step requires touching every data point!!!
©2017 Emily Fox
Sum over data points
Time to compute contribution of xi,yi
# of data points (N)Total time to compute
1 step of gradient ascent
1 millisecond 1000
1 second 1000
1 millisecond 10 million
1 millisecond 10 billion
Time to compute contribution of xi,yi
# of data points (N)Total time to compute
1 step of gradient ascent
1 millisecond 1000
1 second 1000
1 millisecond 10 million
Time to compute contribution of xi,yi
# of data points (N)Total time to compute
1 step of gradient ascent
1 millisecond 1000
1 second 1000
Time to compute contribution of xi,yi
# of data points (N)Total time to compute
1 step of gradient ascent
1 millisecond 1000
CSE 446: Machine Learning44
Data
Stochastic gradient ascent
©2017 Emily Fox
w(t)
Compute gradientUse only
small subsets of data
w(t+1)Update
coefficients
w(t+2)Update
coefficients
w(t+3)Update
coefficients
w(t+4)Update
coefficients
Many updates for each pass
over data
1/25/2017
23
CSE 446: Machine Learning45
Example: Instead of all data points for gradient, use 1 data point only???
©2017 Emily Fox
Gradient ascent
Sum over data points
Stochasticgradient ascent
≈
Each time, pick different data point i
CSE 446: Machine Learning46
Stochastic gradient ascent for logistic regression
©2017 Emily Fox
init w(1)=0, t=1
until convergedfor j=0,…,D
partial[j] =
wj(t+1) wj
(t) + η partial[j]
t t + 1
for i=1,…,N Sum over data points
Each time, pick different data point i
1/25/2017
24
CSE 446: Machine Learning47
Stochastic gradient for L2-regularized objective
What about regularization term?
©2017 Emily Fox
Total derivative = - 2 λ wj
Stochasticgradient ascent
Each time, pick different data point i
≈Total derivative
- 2 λ wjN
Each data point contributes 1/N to regularization
partial[j]
partial[j]
CSE 446: Machine Learning48
Comparing computational time per step
©2017 Emily Fox
Time to compute contribution of xi,yi
# of data points (N)Total time for
1 step of gradient
Total time for 1 step of
stochastic gradient
1 millisecond 1000 1 second
1 second 1000 16.7 minutes
1 millisecond 10 million 2.8 hours
1 millisecond 10 billion 115.7 days
Gradient ascent
≈Stochastic gradient ascent
1/25/2017
25
CSE 446: Machine Learning49
Which one is better??? Depends…
AlgorithmTime per iteration
GradientSlow for
large data
Stochasticgradient
Always fast
©2017 Emily Fox
AlgorithmTime per iteration
In theory
GradientSlow for
large dataSlower
Stochasticgradient
Always fast
Faster
AlgorithmTime per iteration
In theory In practice
GradientSlow for
large dataSlower
Oftenslower
Stochasticgradient
Always fast
FasterOften faster
AlgorithmTime per iteration
In theory In practiceSensitivity to parameters
GradientSlow for
large dataSlower
Oftenslower
Moderate
Stochasticgradient
Always fast
FasterOften faster
Very high
Total time to convergence for large data
CSE 446: Machine Learning50
Summary of stochastic gradient
Tiny change to gradient ascent
Much better scalability
Huge impact in real-world
Very tricky to get right in practice
©2017 Emily Fox
1/25/2017
26
CSE 446: Machine Learning
Why would stochastic gradient ever work???
©2017 Emily Fox
CSE 446: Machine Learning52
Gradient is direction of steepest ascent
©2017 Emily Fox
Gradient is “best” direction, butany direction that goes “up” would be useful
1/25/2017
27
CSE 446: Machine Learning53
In ML, steepest direction is sum of “little directions” from each data point
©2017 Emily Fox
Sum over data points
For most data points,contribution points “up”
CSE 446: Machine Learning54
Stochastic gradient: Pick a data point and move in direction
©2017 Emily Fox
Most of the time, total likelihood will increase
≈
1/25/2017
28
CSE 446: Machine Learning55
Stochastic gradient ascent:Most iterations increase likelihood, but sometimes decrease itOn average, make progress
©2017 Emily Fox
until converged
for i=1,…,N
for j=0,…,D
wj(t+1) wj
(t) + ηt t + 1
CSE 446: Machine Learning
Convergence path
©2017 Emily Fox
1/25/2017
29
CSE 446: Machine Learning57
Convergence paths
©2017 Emily Fox
Gradient
Stochastic gradient
CSE 446: Machine Learning58
Stochastic gradient convergence is “noisy”
©2017 Emily Fox
Bett
er
Avg
. lo
g li
kelih
oo
d
Gradient usually increases likelihood smoothly
Stochastic gradient makes “noisy” progress
Total time proportional to # passes over data
Stochastic gradient achieves higher likelihood sooner, but it’s noisier
1/25/2017
30
CSE 446: Machine Learning59
Eventually, gradient catches up
©2017 Emily Fox
Avg
. lo
g li
kelih
oo
dB
ett
er
Gradient
Stochastic gradient
Note: should only trust “average” quality of stochastic gradient
CSE 446: Machine Learning60
The last coefficients may be really good or really bad!!
©2017 Emily Fox
w(1000) was bad
w(1005) was good
How do we minimize risk of
picking bad coefficients
Stochastic gradient will eventually oscillate around a solution
Minimize noise: don’t return last learned coefficients
Output average:
ŵ= 1 w(t)
T
1/25/2017
31
CSE 446: Machine Learning61
Summary of why stochastic gradient works
Gradient finds direction of steeps ascent
Gradient is sum of contributions from each data point
Stochastic gradient uses direction from 1 data point
On average increases likelihood, sometimes decreases
Stochastic gradient has “noisy” convergence
©2017 Emily Fox
CSE 446: Machine Learning
Online learning: Fitting models from streaming data
©2017 Emily Fox
1/25/2017
32
CSE 446: Machine Learning
Batch vs online learning
Batch learning• All data is available at
start of training time
Online learning• Data arrives (streams in) over time
- Must train model as data arrives!
©2017 Emily Fox
DataML
algorithm
time
ŵ
Data
t=1
Data
t=2
ML algorithm
ŵ(1)
Data
t=3
Data
t=4
ŵ(2)ŵ(3)ŵ(4)
CSE 446: Machine Learning64
Online learning example: Ad targeting
©2017 Emily Fox
Website….......….......….......….......….......….......….......….......….......….......….......….......….......….......….......….......
Input: xt
User info, page text
ML algorithm
ŵ(t)
Ad1
Ad2
Ad3
ŷ= Suggested ads
User clicked on Ad2
yt=Ad2
ŵ(t+1)
1/25/2017
33
CSE 446: Machine Learning65
Online learning problem• Data arrives over each time step t:
- Observe input xt
• Info of user, text of webpage
- Make a predictionŷt
• Which ad to show
- Observe true output yt
• Which ad user clicked on
©2017 Emily Fox
Need ML algorithm to update coefficients each time step!
CSE 446: Machine Learning66
Stochastic gradient ascent can be used for online learning!!!
©2017 Emily Fox
• init w(1)=0, t=1
• Each time step t:- Observe input xt
- Make a predictionŷt
- Observe true output yt
- Update coefficients:
for j=0,…,D
wj(t+1) wj
(t) + η
1/25/2017
34
CSE 446: Machine Learning67
Summary of online learning
Data arrives over time
Must make a prediction every time new data point arrives
Observe true class after prediction made
Want to update parameters immediately
©2017 Emily Fox
CSE 446: Machine Learning
Summary of stochastic gradient descent
©2017 Emily Fox
1/25/2017
35
CSE 446: Machine Learning69
What you can do now…
• Significantly speedup learning algorithm using stochastic gradient
• Describe intuition behind why stochastic gradient works
• Apply stochastic gradient in practice
• Describe online learning problems
• Relate stochastic gradient to online learning
©2017 Emily Fox
CSE 446: Machine Learning
Generalized linear models
©2017 Emily Fox
1/25/2017
36
CSE 446: Machine Learning71
Models for data
• Linear regression with Gaussian errors
yi = wTh(xi) + εi εi ~ N(0,σ2)
p(y|x,w) = N(y; wTh(x), σ2)
• Logistic regression
©2017 Emily Fox
2σ
yµ = wTh(x)
-1 +1 yP(y|x,w) =
1 .
1 + e- w h(x)T
e- w h(x).
1 + e- w h(x)T
T
y = +1
y = -1
CSE 446: Machine Learning72
Examples of “generalized linear models”
A generalized linear model has
©2017 Emily Fox
E[y | x,w ] = g(wTh(x))
Mean Link function g
Linear regression wTh(x) identity function
Logistic regression
(assuming y in {0,1} instead of {-1,1} or some
other set of values)
P(y=+1|x,w)
(mean is different if y is not in {0,1})
sigmoid function
(need slight modificationif y is not in {0,1})
1/25/2017
37
CSE 446: Machine Learning
Stochastic gradient descent more formally
©2017 Emily Fox
CSE 446: Machine Learning74
Learning Problems as Expectations• Minimizing loss in training data:
- Given dataset:
• Sampled iid from some distribution p(x) on features:
- Loss function, e.g., hinge loss, logistic loss,…
- We often minimize loss in training data:
• However, we should really minimize expected loss on all data:
• So, we are approximating the integral by the average on the training data
©2017 Emily Fox
1/25/2017
38
CSE 446: Machine Learning75
Gradient Ascent in Terms of Expectations
• “True” objective function:
• Taking the gradient:
• “True” gradient ascent rule:
• How do we estimate expected gradient?
©2017 Emily Fox
CSE 446: Machine Learning76
SGD: Stochastic Gradient Ascent (or Descent)
• “True” gradient:
• Sample based approximation:
• What if we estimate gradient with just one sample???- Unbiased estimate of gradient- Very noisy!
- Called stochastic gradient ascent (or descent)• Among many other names
- VERY useful in practice!!!
©2017 Emily Fox