applied machine learning lecture 5-1: optimization in ...richajo/dit866/lectures/l5/l5_1.pdf ·...

Applied Machine LearningLecture 5-1: Optimization in machine learning

Selpi ([email protected])

The slides are further development of Richard Johansson's slides

February 7, 2020

mailto:[email protected]

Overview

Optimization in machine learning

SGD for training a linear regression model

introduction to PyTorch

regularized linear regression models

Objective functions for machine learning

I in this lecture, we'll discuss objective functions for our

models

I training a model = minimizing (or maximizing) this objective

I often, the objective consists of a combination ofI a loss function that measures how well the model �ts the

training dataI a regularizer that measures how simple/complex the model is

minimizing squared errors

I in the least squares approach to linear regression, we want to

minimize the mean (or sum) of squared errors

f (w) =1

N

N∑i=1

(w · x i − yi )2

1.5 1.0 0.5 0.0 0.5 1.0 1.5

1.5

1.0

0.5

0.0

0.5

1.0

1.5

I From calculus: if a �nice� function has a maximum or

minimum, then the derivative will be zero there

the gradientI the multidimensional equivalent of the derivative is called the

gradient

I if f is a function of n variables, then the gradient is an

n-dimensional vector, often written ∇f (x)I intuition: the gradient points in the uphill direction

I again: the gradient is zero if we have an optimum

I this intuition leads to a simple idea for �nding the minimum:I take a small step in the direction opposite to the gradientI repeat until the gradient is close enough to zero

Gradient descent, pseudocode

I the same thing again, in pseudocode:

1. set x to some initial value, and select a suitable step size η2. compute the gradient ∇f (x)3. if ∇f (x) is small enough, we are done4. otherwise, subtract η · ∇f (x) from x and go back to step 2

I conversely, to �nd the maximum we can do gradient ascent:then we instead add η · ∇f (x) to x

Overview





stochastic gradient descent (SGD)

I in machine learning, our objective functions are often de�ned

as sums over the whole training set, such as the least squares

loss function

f (w) =1

N

N∑i=1

(w · x i − yi )2

I if N is large, it would take time to use gradient descent tominimize the loss functionI because gradient descent takes into account all the instances

in the training set

I stochastic gradient descent: simplify the computation bycomputing the gradient using just a small partI in the extreme case, a single training instanceI or minibatch: a few instances

SGD: pseudocode

1. set w to some initial value, and select a suitable step size η

2. select a single training instance x

3. compute the gradient ∇f (w) using x only

4. if we are �done�, stop

5. otherwise, subtract η · ∇f (w) from w and go back to step 2

w = (0, . . . , 0)for (x i , yi ) . . .

compute gradient ∇fi for current instance (x i , yi )w = w − η · ∇fi (w)

return w

when to terminate SGD?

I simple solution: �xed number of iterations

I or stop when we've seen no improvement for some time

I or evaluate on a held-out set: early stopping

applying SGD to the least squares loss

I the least squares loss function:

f (w) =1

N

N∑i=1

(w · x i − yi )2

I let's consider just a single instance:

fi (w) = (w · x i − yi )2

I for this instance, the gradient of the least squares loss with

respect to w is

∇fi (w) = 2 · (w · x i − yi ) · x i

plugging it back into SGD ⇒ Widrow-Ho� again!

w = (0, . . . , 0)for (x i , yi ) . . .∇fi (w) = 2 · error · x iw = w − η · ∇fi (w)

return w

Overview





the Python machine learning ecosystem (selection)

�nding and installing PyTorch

I easy to install using conda

I https://pytorch.org/

https://pytorch.org/

what is the purpose of PyTorch?

I on a high level: a library for prototyping ML models

I on a low level: NumPy-like with automatic gradients

I easy to move computations to a GPU (if available)

[source]

http://colah.github.io/posts/2015-08-Backprop/

Overview





recall: the fundamental tradeo� in machine learning

I goodness of �t: the learned model should describe

the examples in the training data

I regularization: the model should be simple

[Image source]

I the least squares loss just takes care of the �rst part!

f (w) =1

N

N∑i=1

(w · x i − yi )2

1.5 1.0 0.5 0.0 0.5 1.0 1.5

1.5

1.0

0.5

0.0

0.5

1.0

1.5

https://codingstartups.com/practical-machine-learning-ridge-regression-vs-lasso/

what does it mean to �keep the model simple�?

I concretely, how can we add regularization to the linear

regression model?

I the most common approach is to add a term that keeps theweights small

I for instance, by penalizing the squared length (norm) of the

weight vector should be small:

‖w‖2 = w1 · w1 + . . .+ wn · wn = w ·w

I this is called a L2 regularizer (or a L2 penalty)

combining the pieces

I we combine the loss and the regularizer:

1

N

N∑i=1

Loss(w , x i , yi ) + α · Regularizer(w)

=1

N

N∑i=1

(w · xi − yi )2 + α · ‖w‖2

I in this formula, α is a �tweaking� parameter that controls the

tradeo� between loss and regularization

ridge regression

I the combination of least squares loss with L2 regularization is

called ridge regression

1

N

N∑i=1

(w · xi − yi )2 + α · ‖w‖2

I in scikit-learn: sklearn.linear_model.Ridge

I (unregularized least squares:

sklearn.linear_model.LinearRegression)

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Ridge.html

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LinearRegression.html

discussion

1

N

N∑i=1

(w · xi − yi )2 + α · ‖w‖2

I what happens if we set α to 0?

I what happens if we set α to a huge number?

why �ridge�?

[Image source]

https://stats.stackexchange.com/questions/151304/why-is-ridge-regression-called-ridge-why-is-it-needed-and-what-happens-when

SGD for the ridge regression model

1

N

∑(w · xi − yi )

2 + α · ‖w‖2

I the gradient with respect to a single training instance is

∇fi (w) = 2 · error · x i + 2 · α ·w

w = (0, . . . , 0)for (x i , yi ) in the training set

g = w · x ierror = g − yiw = w − η · (error · x i+α ·w)

return w

the Lasso and ElasticNet models

I another common regularizer is the L1 norm

‖w‖1 = |w1|+ . . .+ |wn|

I the L1 norm gives the Lasso model

I combination of L1 and L2 gives the ElasticNet model

I in scikit-learn:I sklearn.linear_model.Lasso

I sklearn.linear_model.ElasticNet

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html

http://scikit-learn.org/stable/modules/generated/sklearn.linear_model.ElasticNet.html

why does Lasso give zero coe�cients?

I �gure from Hastie et al. book

https://web.stanford.edu/~hastie/Papers/ESLII.pdf

applied machine learning lecture 5-1: optimization in ...richajo/dit866/lectures/l5/l5_1.pdf ·...

Documents