applied machine learning lecture 5-1: optimization in ...richajo/dit866/lectures/l5/l5_1.pdf ·...

27

Upload: others

Post on 26-May-2020

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

Applied Machine LearningLecture 5-1: Optimization in machine learning

Selpi ([email protected])

The slides are further development of Richard Johansson's slides

February 7, 2020

Page 2: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

Overview

Optimization in machine learning

SGD for training a linear regression model

introduction to PyTorch

regularized linear regression models

Page 3: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

Objective functions for machine learning

I in this lecture, we'll discuss objective functions for our

models

I training a model = minimizing (or maximizing) this objective

I often, the objective consists of a combination ofI a loss function that measures how well the model �ts the

training dataI a regularizer that measures how simple/complex the model is

Page 4: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

minimizing squared errors

I in the least squares approach to linear regression, we want to

minimize the mean (or sum) of squared errors

f (w) =1

N

N∑i=1

(w · x i − yi )2

1.5 1.0 0.5 0.0 0.5 1.0 1.5

1.5

1.0

0.5

0.0

0.5

1.0

1.5

I From calculus: if a �nice� function has a maximum or

minimum, then the derivative will be zero there

Page 5: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

the gradientI the multidimensional equivalent of the derivative is called the

gradient

I if f is a function of n variables, then the gradient is an

n-dimensional vector, often written ∇f (x)I intuition: the gradient points in the uphill direction

I again: the gradient is zero if we have an optimum

I this intuition leads to a simple idea for �nding the minimum:I take a small step in the direction opposite to the gradientI repeat until the gradient is close enough to zero

Page 6: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

Gradient descent, pseudocode

I the same thing again, in pseudocode:

1. set x to some initial value, and select a suitable step size η2. compute the gradient ∇f (x)3. if ∇f (x) is small enough, we are done4. otherwise, subtract η · ∇f (x) from x and go back to step 2

I conversely, to �nd the maximum we can do gradient ascent:then we instead add η · ∇f (x) to x

Page 7: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

Overview

Optimization in machine learning

SGD for training a linear regression model

introduction to PyTorch

regularized linear regression models

Page 8: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

stochastic gradient descent (SGD)

I in machine learning, our objective functions are often de�ned

as sums over the whole training set, such as the least squares

loss function

f (w) =1

N

N∑i=1

(w · x i − yi )2

I if N is large, it would take time to use gradient descent tominimize the loss functionI because gradient descent takes into account all the instances

in the training set

I stochastic gradient descent: simplify the computation bycomputing the gradient using just a small partI in the extreme case, a single training instanceI or minibatch: a few instances

Page 9: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

SGD: pseudocode

1. set w to some initial value, and select a suitable step size η

2. select a single training instance x

3. compute the gradient ∇f (w) using x only

4. if we are �done�, stop

5. otherwise, subtract η · ∇f (w) from w and go back to step 2

w = (0, . . . , 0)for (x i , yi ) . . .

compute gradient ∇fi for current instance (x i , yi )w = w − η · ∇fi (w)

return w

Page 10: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

SGD: pseudocode

1. set w to some initial value, and select a suitable step size η

2. select a single training instance x

3. compute the gradient ∇f (w) using x only

4. if we are �done�, stop

5. otherwise, subtract η · ∇f (w) from w and go back to step 2

w = (0, . . . , 0)for (x i , yi ) . . .

compute gradient ∇fi for current instance (x i , yi )w = w − η · ∇fi (w)

return w

Page 11: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

when to terminate SGD?

I simple solution: �xed number of iterations

I or stop when we've seen no improvement for some time

I or evaluate on a held-out set: early stopping

Page 12: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

applying SGD to the least squares loss

I the least squares loss function:

f (w) =1

N

N∑i=1

(w · x i − yi )2

I let's consider just a single instance:

fi (w) = (w · x i − yi )2

I for this instance, the gradient of the least squares loss with

respect to w is

∇fi (w) = 2 · (w · x i − yi ) · x i

Page 13: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

plugging it back into SGD ⇒ Widrow-Ho� again!

w = (0, . . . , 0)for (x i , yi ) . . .∇fi (w) = 2 · error · x iw = w − η · ∇fi (w)

return w

Page 14: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

Overview

Optimization in machine learning

SGD for training a linear regression model

introduction to PyTorch

regularized linear regression models

Page 15: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

the Python machine learning ecosystem (selection)

Page 16: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

�nding and installing PyTorch

I easy to install using conda

I https://pytorch.org/

Page 17: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

what is the purpose of PyTorch?

I on a high level: a library for prototyping ML models

I on a low level: NumPy-like with automatic gradients

I easy to move computations to a GPU (if available)

[source]

Page 18: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

Overview

Optimization in machine learning

SGD for training a linear regression model

introduction to PyTorch

regularized linear regression models

Page 19: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

recall: the fundamental tradeo� in machine learning

I goodness of �t: the learned model should describe

the examples in the training data

I regularization: the model should be simple

[Image source]

I the least squares loss just takes care of the �rst part!

f (w) =1

N

N∑i=1

(w · x i − yi )2

1.5 1.0 0.5 0.0 0.5 1.0 1.5

1.5

1.0

0.5

0.0

0.5

1.0

1.5

Page 20: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

what does it mean to �keep the model simple�?

I concretely, how can we add regularization to the linear

regression model?

I the most common approach is to add a term that keeps theweights small

I for instance, by penalizing the squared length (norm) of the

weight vector should be small:

‖w‖2 = w1 · w1 + . . .+ wn · wn = w ·w

I this is called a L2 regularizer (or a L2 penalty)

Page 21: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

combining the pieces

I we combine the loss and the regularizer:

1

N

N∑i=1

Loss(w , x i , yi ) + α · Regularizer(w)

=1

N

N∑i=1

(w · xi − yi )2 + α · ‖w‖2

I in this formula, α is a �tweaking� parameter that controls the

tradeo� between loss and regularization

Page 22: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

ridge regression

I the combination of least squares loss with L2 regularization is

called ridge regression

1

N

N∑i=1

(w · xi − yi )2 + α · ‖w‖2

I in scikit-learn: sklearn.linear_model.Ridge

I (unregularized least squares:

sklearn.linear_model.LinearRegression)

Page 23: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

discussion

1

N

N∑i=1

(w · xi − yi )2 + α · ‖w‖2

I what happens if we set α to 0?

I what happens if we set α to a huge number?

Page 25: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

SGD for the ridge regression model

1

N

∑(w · xi − yi )

2 + α · ‖w‖2

I the gradient with respect to a single training instance is

∇fi (w) = 2 · error · x i + 2 · α ·w

w = (0, . . . , 0)for (x i , yi ) in the training set

g = w · x ierror = g − yiw = w − η · (error · x i+α ·w)

return w

Page 26: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

the Lasso and ElasticNet models

I another common regularizer is the L1 norm

‖w‖1 = |w1|+ . . .+ |wn|

I the L1 norm gives the Lasso model

I combination of L1 and L2 gives the ElasticNet model

I in scikit-learn:I sklearn.linear_model.Lasso

I sklearn.linear_model.ElasticNet

Page 27: Applied Machine Learning Lecture 5-1: Optimization in ...richajo/dit866/lectures/l5/l5_1.pdf · Lecture 5-1: Optimization in machine learning Selpi (selpi@chalmers.se) The slides

why does Lasso give zero coe�cients?

I �gure from Hastie et al. book