Download - 02 - Logistic Regression - GitHub Pages
![Page 1: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/1.jpg)
[4C 6] Deep Learning and its Applications — 8/ 9
02 - Logistic Regression
François Pitié
Ussher Assistant Professor in Media Signal ProcessingDepartment of Electronic & Electrical Engineering, Trinity College Dublin
![Page 2: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/2.jpg)
Motivation
With Linear Regression, we looked at linear models, where the out-put of the problem was a continuous variable (eg. height, car price,temperature, …).Very often you need to design a classifier that can answer questionssuch as: what car type is it? is the person smiling? is a solar flaregoing to happen? In such problems themodel depends on categoricalvariables.Logistic Regression (David Cox, ), considers the case of a binaryvariable, where the outcome is / or true/false.
![Page 3: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/3.jpg)
There is a whole zoo of classifiers out there. Why are we coveringlogistic regression in particular?Because logistic regression is the building block of Neural Nets.
![Page 4: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/4.jpg)
Introductory Example
We’ll start with an example from Wikipedia:A group of students spend between and 6 hours studying foran exam. How does the number of hours spent studying affect theprobability that the student will pass the exam?
![Page 5: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/5.jpg)
Introductory Example
The collected data looks like so:Studying Hours : .75 . .75 .5 ...result ( =pass, =fail) : ...
0 1 2 3 4 5 6x (Hours studying)
01
y (E
xam
Out
com
e)
![Page 6: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/6.jpg)
Linear Approximation
Although the output is binary, we could still attempt to fit a linearmodel via least squares:
ℎw(x) = x w = + +⋯
![Page 7: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/7.jpg)
Linear Approximation
This is what the least squares estimate ℎw(x) looks like:
0 1 2 3 4 5 6x (Hours studying)
0.0
0.2
0.4
0.6
0.8
1.0
y (E
xam
Out
com
e)
![Page 8: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/8.jpg)
The model prediction ℎw(x) = x w is continuous, but we could apply
a threshold to obtain the binary classifier as follows:
= [x w > .5] = if x w ≤ .5if x w > .5
and the output would be or .Obviously this is not optimal as we have chosenw so that x wmatchesand not so that [w x > .5] matches .
Let’s see how this can be done.
![Page 9: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/9.jpg)
General Linear Model
The general problem of general linear models can be presented asfollows. We are trying to find a linear combination of the data x w,such that the sign of x w tells us about the outcome := [x w + 𝜖 > ]The quantity x w is sometimes called the risk score. It is a scalarvalue. The larger the value of x w is, the more certain we are that= .The error term is represented by the random variable 𝜖. Multiplechoices are possible for the distribution of 𝜖.
![Page 10: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/10.jpg)
In logistic regression, the error 𝜖 is assumed to follow a logistic dis-tribution and the risk score x w is also called the logit.
10 5 0 5 10ǫ
0.000
0.001
0.002
0.003
0.004
0.005
p(ǫ
)
Figure: pdf of the logistic distribution
![Page 11: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/11.jpg)
In probit regression, the error 𝜖 is assumed to follow a normal distri-bution, the risk score x w is also called the probit.
10 5 0 5 10ǫ
0.000
0.002
0.004
0.006
0.008
p(ǫ
)
Figure: pdf of the normal distribution
![Page 12: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/12.jpg)
For our purposes, there is not much difference between logistic andlogit regression. The main difference is that logistic regression is nu-merically easier to solve.From now on, we’ll only look at the logistic model but note that similarderivations could be made for any other model.
![Page 13: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/13.jpg)
Logistic Model
Consider 𝑝( = |x), the likelihood that the output is a success:𝑝( = |x) = 𝑝(x w + 𝜖 > )= 𝑝(𝜖 > −x w)since 𝜖 is symmetrically distributed around , it follows that𝑝( = |x) = 𝑝(𝜖 < x w)Because we have made some assumptions about the distribution of𝜖, we are able to derive a closed-form expression for the likelihood.
![Page 14: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/14.jpg)
The Logistic Function
The function 𝑓 ∶ 𝑡 ↦ 𝑓(𝑡) = 𝑝(𝜖 < 𝑡) is the c.d.f. of the logisticdistribution and is also called the logistic function or sigmoid:
𝑓(𝑡) = + 𝑒6 4 2 0 2 4 6
t
0.0
0.2
0.4
0.6
0.8
1.0
f(t)
![Page 15: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/15.jpg)
Thus we have a simple model for the likelihood of success ℎw(x) =𝑝( = |x):ℎ
w(x) = 𝑝( = |x) = 𝑝(𝜖 < x w) = 𝑓(x w) = + 𝑒 x w
The likelihood of failure is simply given by:𝑝( = |x) = − ℎw(x)
Exercise:
show that 𝑝( = |x) = ℎw(−x)
![Page 16: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/16.jpg)
In linear regression, the model ℎw(x) was a direct prediction of the
outcome:ℎw(x) =
In logistic regression, the model ℎw(x) makes an estimation of the
likelihood of the outcome:ℎw(x) = 𝑝( = |x)
Thus whereas in linear regression we try to answer the question:What is the value of given x?In logistic regression (and any other general linear model), we try in-stead to answer the question:What is the probability that = given x?
![Page 17: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/17.jpg)
Logistic Regression
Below is the plot of an estimated ℎw(x) ≈ 𝑝( = |x) for our problem:
0 1 2 3 4 5 6x (Hours studying)
0.0
0.2
0.4
0.6
0.8
1.0y
(Exa
m O
utco
me)
The results are easy to interpret: there is about % chance to passthe exam if you study for hours.
![Page 18: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/18.jpg)
Maximum Likelihood
To estimate the weightsw, we will again use the concept of MaximumLikelihood.
![Page 19: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/19.jpg)
Maximum Likelihood
As we’ve just seen, for a particular observation x , the likelihood isgiven by:
𝑝( = |x ) = 𝑝( = |x ) = ℎw(x ) if =𝑝( = |x ) = − ℎ
w(x ) if =
As ∈ { , }, this can be rewritten in a slightly more compact form as:𝑝( = |x ) = ℎw(x ) ( − ℎ
w(x ))
This works because = .The likelihood over all observations is then:
𝑝(y|X) = ℎw(x ) ( − ℎ
w(x ))
![Page 20: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/20.jpg)
Maximum Likelihood
We want to find w that maximises the likelihood 𝑝(y|X). As always, itis equivalent but more convenient to minimise the negative log like-lihood:𝐸(w) = −ln(𝑝(y|X))
= − ln (ℎw(x )) − ( − ) ln ( − ℎ
w(x ))
This error function we need to minimise is called the cross-entropy.In the Machine Learning community, the error function is also fre-quently called a loss function. Thus here we would say: the lossfunction is the cross-entropy.
![Page 21: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/21.jpg)
We could have considered optimising the parameters w using otherloss functions. For instance we could have tried to minimise the leastsquare error as we did in linear regression:
𝐸 (w) = (ℎw(x ) − )
The solution would not maximise the likelihood, as would the cross-entropy loss, but maybe that would still be a reasonable thing to do?The problem is that ℎ
wis non-convex, which makes the minimisation
of 𝐸 (w) much harder than when using cross-entropy.This is in fact a mistake that the Neural Net community did for a num-ber of years before switching to the cross entropy loss function.
![Page 22: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/22.jpg)
Optimisation: gradient descent
To minimise the error function, we need to resort to gradient descent,which is a general method for nonlinear optimisation and which willbe at the core of neural networks optimisation.We start at w( ) and take steps along the steepest direction v using afixed size step as follows:
w( ) = w( ) + 𝜂v( )𝜂 is called the learning rate and controls the speed of the descent.What is the steepest slope v?
![Page 23: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/23.jpg)
Optimisation: gradient descent
Without loss of generality we set v to be a unit vector (ie. ‖v‖ = ).Then, moving w to w + 𝜂v yields a new error as follows:
𝐸 (w + 𝜂v) = 𝐸 (w) + 𝜂 𝜕𝐸𝜕w v + 𝑂(𝜂 )which reaches a minimum when
v = − w‖w
‖
![Page 24: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/24.jpg)
Optimisation: gradient descent
now, it is hard to find a good value for the learning rate 𝜂 and weusually adopt an adaptive step instead. Thus instead of using
w( ) = w( ) − 𝜂 w‖w
‖we usually use the following update step:
w( ) = w( ) − 𝜂 𝜕𝐸𝜕w
![Page 25: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/25.jpg)
Optimisation: gradient descent
Recall that the cross-entropy loss function is:
𝐸 = − ln(ℎw(x )) − ( − ) ln( − ℎ
w(x ))
and that ℎw(x) = 𝑓(x w) = + 𝑒 x w
Exercise:
Given that the derivative of the sigmoid 𝑓 is 𝑓 (𝑡) = ( − 𝑓(𝑡))𝑓(𝑡),show that𝜕𝐸𝜕w = (ℎ
w(x ) − )x
![Page 26: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/26.jpg)
Optimisation: gradient descent
The overall gradient descent method looks like so:
. set an initial weight vector w( ) and
. for 𝑡 = , , , ⋯ do until convergence
. compute the gradient𝜕𝐸𝜕w = + 𝑒 x w
− x
. update the weights: w( ) = w( ) − 𝜂w
![Page 27: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/27.jpg)
Example
Below is an example with features.
4 3 2 1 0 1 2 3 4432101234
![Page 28: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/28.jpg)
Example
The estimate for the probability of success isℎw(x) = /( + 𝑒 ( . . . ))
Below are drawn the lines that correspond to ℎw(x) = . 5, ℎ
w(x) =.5 and ℎ
w(x) = .95.
4 3 2 1 0 1 2 3 4432101234
class 1class 050%95%5%
![Page 29: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/29.jpg)
Multiclass Classification
![Page 30: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/30.jpg)
In many applications you have to deal with more than classes.
![Page 31: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/31.jpg)
one-vs.-all
The simplest way to consider a problem that has more than classesis to adopt the one-vs.-all (or one-against-all) strategy:For each class 𝑘, you can train a single binary classifier ( = for allother classes, and = for class 𝑘). The classifiers return a real-valued likelihood for their decision.The one-vs.-all prediction then picks the label for which the corre-sponding classifier reports the highest likelihood.
![Page 32: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/32.jpg)
The one-vs.-all approach is a very simple one. However it is an heuris-tic that has many problems.One problem is that for each binary classifier, the negative samples(from all the classes but 𝑘) are more numerous and more heteroge-neous than the positive samples (from class 𝑘).A better approach would be to have a unified model for all classifiersand jointly train them. The extension of Logistic regression that justdoes this is called multinomial logistic regression.
![Page 33: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/33.jpg)
Multinomial Logistic Regression
In Multinomial Logistic Regression, each of the binary classifier isbased on the following likelihood model:
𝑝( = 𝐶 |x) = softmax(x w) = exp(w x)∑ exp(w x)𝐶 is the class 𝑘 and softmax ∶ ℝ → ℝ is the function defined assoftmax(t) = exp(𝑡 )∑ exp(𝑡 )In other words, softmax takes as an input the vector of logits for allclasses and returns the vector of corresponding likelihoods.
![Page 34: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/34.jpg)
Softmax Optimisation
To optimise for the parameters. We can take again themaximum like-lihood approach.Combining the likelihood for all possible classes gives us:𝑝( |x) = 𝑝( = 𝐶 |x)[ ] ×⋯× 𝑝( = 𝐶 |x)[ ]where [ = 𝐶 ] is if = 𝐶 and otherwise.The total likelihood is:
𝑝( |X) = 𝑝( = 𝐶 |x )[ ] ×⋯× 𝑝( = 𝐶 |x )[ ]
![Page 35: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/35.jpg)
Taking the negative log likelihood yields the cross entropy error func-tion for the multiclass problem:
𝐸(w , ⋯ ,w ) = −ln(𝑝( |X)) = − [ = 𝐶 ] ln(𝑝( = 𝐶 |x ))Similarly to logistic regression, we can use a gradient descent ap-proach to find the𝐾 weight vectorsw , ⋯ ,w that minimise this crossentropy expression.
![Page 36: 02 - Logistic Regression - GitHub Pages](https://reader030.vdocument.in/reader030/viewer/2022020700/61f60921deb36525ee394302/html5/thumbnails/36.jpg)
Take Away
With Logistic Regression, we look at linear models, where the outputof the problem is a binary categorical response.Instead of directly predicting the actual outcome as in least squares,the model proposed in logistic regression makes a prediction aboutthe likelihood of belonging to a particular class.Finding the maximum likelihood parameters is equivalent to minimis-ing the cross entropy loss function. The minimisation can be doneusing the gradient descent technique.The extension of Logistic Regression to more than classes is calledthe Multinomial Logistic Regression.