02 - logistic regression - github pages

36
[4Cͳ6] Deep Learning and its Applications — ʹͲͳ8/ʹͲͳ9 02 - Logistic Regression François Pitié Ussher Assistant Professor in Media Signal Processing Department of Electronic & Electrical Engineering, Trinity College Dublin ͱ

Upload: others

Post on 30-Jan-2022

3 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 02 - Logistic Regression - GitHub Pages

[4C 6] Deep Learning and its Applications — 8/ 9

02 - Logistic Regression

François Pitié

Ussher Assistant Professor in Media Signal ProcessingDepartment of Electronic & Electrical Engineering, Trinity College Dublin

Page 2: 02 - Logistic Regression - GitHub Pages

Motivation

With Linear Regression, we looked at linear models, where the out-put of the problem was a continuous variable (eg. height, car price,temperature, …).Very often you need to design a classifier that can answer questionssuch as: what car type is it? is the person smiling? is a solar flaregoing to happen? In such problems themodel depends on categoricalvariables.Logistic Regression (David Cox, ), considers the case of a binaryvariable, where the outcome is / or true/false.

Page 3: 02 - Logistic Regression - GitHub Pages

There is a whole zoo of classifiers out there. Why are we coveringlogistic regression in particular?Because logistic regression is the building block of Neural Nets.

Page 4: 02 - Logistic Regression - GitHub Pages

Introductory Example

We’ll start with an example from Wikipedia:A group of students spend between and 6 hours studying foran exam. How does the number of hours spent studying affect theprobability that the student will pass the exam?

Page 5: 02 - Logistic Regression - GitHub Pages

Introductory Example

The collected data looks like so:Studying Hours : .75 . .75 .5 ...result ( =pass, =fail) : ...

0 1 2 3 4 5 6x (Hours studying)

01

y (E

xam

Out

com

e)

Page 6: 02 - Logistic Regression - GitHub Pages

Linear Approximation

Although the output is binary, we could still attempt to fit a linearmodel via least squares:

ℎw(x) = x w = + +⋯

Page 7: 02 - Logistic Regression - GitHub Pages

Linear Approximation

This is what the least squares estimate ℎw(x) looks like:

0 1 2 3 4 5 6x (Hours studying)

0.0

0.2

0.4

0.6

0.8

1.0

y (E

xam

Out

com

e)

Page 8: 02 - Logistic Regression - GitHub Pages

The model prediction ℎw(x) = x w is continuous, but we could apply

a threshold to obtain the binary classifier as follows:

= [x w > .5] = if x w ≤ .5if x w > .5

and the output would be or .Obviously this is not optimal as we have chosenw so that x wmatchesand not so that [w x > .5] matches .

Let’s see how this can be done.

Page 9: 02 - Logistic Regression - GitHub Pages

General Linear Model

The general problem of general linear models can be presented asfollows. We are trying to find a linear combination of the data x w,such that the sign of x w tells us about the outcome := [x w + 𝜖 > ]The quantity x w is sometimes called the risk score. It is a scalarvalue. The larger the value of x w is, the more certain we are that= .The error term is represented by the random variable 𝜖. Multiplechoices are possible for the distribution of 𝜖.

Page 10: 02 - Logistic Regression - GitHub Pages

In logistic regression, the error 𝜖 is assumed to follow a logistic dis-tribution and the risk score x w is also called the logit.

10 5 0 5 10ǫ

0.000

0.001

0.002

0.003

0.004

0.005

p(ǫ

)

Figure: pdf of the logistic distribution

Page 11: 02 - Logistic Regression - GitHub Pages

In probit regression, the error 𝜖 is assumed to follow a normal distri-bution, the risk score x w is also called the probit.

10 5 0 5 10ǫ

0.000

0.002

0.004

0.006

0.008

p(ǫ

)

Figure: pdf of the normal distribution

Page 12: 02 - Logistic Regression - GitHub Pages

For our purposes, there is not much difference between logistic andlogit regression. The main difference is that logistic regression is nu-merically easier to solve.From now on, we’ll only look at the logistic model but note that similarderivations could be made for any other model.

Page 13: 02 - Logistic Regression - GitHub Pages

Logistic Model

Consider 𝑝( = |x), the likelihood that the output is a success:𝑝( = |x) = 𝑝(x w + 𝜖 > )= 𝑝(𝜖 > −x w)since 𝜖 is symmetrically distributed around , it follows that𝑝( = |x) = 𝑝(𝜖 < x w)Because we have made some assumptions about the distribution of𝜖, we are able to derive a closed-form expression for the likelihood.

Page 14: 02 - Logistic Regression - GitHub Pages

The Logistic Function

The function 𝑓 ∶ 𝑡 ↦ 𝑓(𝑡) = 𝑝(𝜖 < 𝑡) is the c.d.f. of the logisticdistribution and is also called the logistic function or sigmoid:

𝑓(𝑡) = + 𝑒6 4 2 0 2 4 6

t

0.0

0.2

0.4

0.6

0.8

1.0

f(t)

Page 15: 02 - Logistic Regression - GitHub Pages

Thus we have a simple model for the likelihood of success ℎw(x) =𝑝( = |x):ℎ

w(x) = 𝑝( = |x) = 𝑝(𝜖 < x w) = 𝑓(x w) = + 𝑒 x w

The likelihood of failure is simply given by:𝑝( = |x) = − ℎw(x)

Exercise:

show that 𝑝( = |x) = ℎw(−x)

Page 16: 02 - Logistic Regression - GitHub Pages

In linear regression, the model ℎw(x) was a direct prediction of the

outcome:ℎw(x) =

In logistic regression, the model ℎw(x) makes an estimation of the

likelihood of the outcome:ℎw(x) = 𝑝( = |x)

Thus whereas in linear regression we try to answer the question:What is the value of given x?In logistic regression (and any other general linear model), we try in-stead to answer the question:What is the probability that = given x?

Page 17: 02 - Logistic Regression - GitHub Pages

Logistic Regression

Below is the plot of an estimated ℎw(x) ≈ 𝑝( = |x) for our problem:

0 1 2 3 4 5 6x (Hours studying)

0.0

0.2

0.4

0.6

0.8

1.0y

(Exa

m O

utco

me)

The results are easy to interpret: there is about % chance to passthe exam if you study for hours.

Page 18: 02 - Logistic Regression - GitHub Pages

Maximum Likelihood

To estimate the weightsw, we will again use the concept of MaximumLikelihood.

Page 19: 02 - Logistic Regression - GitHub Pages

Maximum Likelihood

As we’ve just seen, for a particular observation x , the likelihood isgiven by:

𝑝( = |x ) = 𝑝( = |x ) = ℎw(x ) if =𝑝( = |x ) = − ℎ

w(x ) if =

As ∈ { , }, this can be rewritten in a slightly more compact form as:𝑝( = |x ) = ℎw(x ) ( − ℎ

w(x ))

This works because = .The likelihood over all observations is then:

𝑝(y|X) = ℎw(x ) ( − ℎ

w(x ))

Page 20: 02 - Logistic Regression - GitHub Pages

Maximum Likelihood

We want to find w that maximises the likelihood 𝑝(y|X). As always, itis equivalent but more convenient to minimise the negative log like-lihood:𝐸(w) = −ln(𝑝(y|X))

= − ln (ℎw(x )) − ( − ) ln ( − ℎ

w(x ))

This error function we need to minimise is called the cross-entropy.In the Machine Learning community, the error function is also fre-quently called a loss function. Thus here we would say: the lossfunction is the cross-entropy.

Page 21: 02 - Logistic Regression - GitHub Pages

We could have considered optimising the parameters w using otherloss functions. For instance we could have tried to minimise the leastsquare error as we did in linear regression:

𝐸 (w) = (ℎw(x ) − )

The solution would not maximise the likelihood, as would the cross-entropy loss, but maybe that would still be a reasonable thing to do?The problem is that ℎ

wis non-convex, which makes the minimisation

of 𝐸 (w) much harder than when using cross-entropy.This is in fact a mistake that the Neural Net community did for a num-ber of years before switching to the cross entropy loss function.

Page 22: 02 - Logistic Regression - GitHub Pages

Optimisation: gradient descent

To minimise the error function, we need to resort to gradient descent,which is a general method for nonlinear optimisation and which willbe at the core of neural networks optimisation.We start at w( ) and take steps along the steepest direction v using afixed size step as follows:

w( ) = w( ) + 𝜂v( )𝜂 is called the learning rate and controls the speed of the descent.What is the steepest slope v?

Page 23: 02 - Logistic Regression - GitHub Pages

Optimisation: gradient descent

Without loss of generality we set v to be a unit vector (ie. ‖v‖ = ).Then, moving w to w + 𝜂v yields a new error as follows:

𝐸 (w + 𝜂v) = 𝐸 (w) + 𝜂 𝜕𝐸𝜕w v + 𝑂(𝜂 )which reaches a minimum when

v = − w‖w

Page 24: 02 - Logistic Regression - GitHub Pages

Optimisation: gradient descent

now, it is hard to find a good value for the learning rate 𝜂 and weusually adopt an adaptive step instead. Thus instead of using

w( ) = w( ) − 𝜂 w‖w

‖we usually use the following update step:

w( ) = w( ) − 𝜂 𝜕𝐸𝜕w

Page 25: 02 - Logistic Regression - GitHub Pages

Optimisation: gradient descent

Recall that the cross-entropy loss function is:

𝐸 = − ln(ℎw(x )) − ( − ) ln( − ℎ

w(x ))

and that ℎw(x) = 𝑓(x w) = + 𝑒 x w

Exercise:

Given that the derivative of the sigmoid 𝑓 is 𝑓 (𝑡) = ( − 𝑓(𝑡))𝑓(𝑡),show that𝜕𝐸𝜕w = (ℎ

w(x ) − )x

Page 26: 02 - Logistic Regression - GitHub Pages

Optimisation: gradient descent

The overall gradient descent method looks like so:

. set an initial weight vector w( ) and

. for 𝑡 = , , , ⋯ do until convergence

. compute the gradient𝜕𝐸𝜕w = + 𝑒 x w

− x

. update the weights: w( ) = w( ) − 𝜂w

Page 27: 02 - Logistic Regression - GitHub Pages

Example

Below is an example with features.

4 3 2 1 0 1 2 3 4432101234

Page 28: 02 - Logistic Regression - GitHub Pages

Example

The estimate for the probability of success isℎw(x) = /( + 𝑒 ( . . . ))

Below are drawn the lines that correspond to ℎw(x) = . 5, ℎ

w(x) =.5 and ℎ

w(x) = .95.

4 3 2 1 0 1 2 3 4432101234

class 1class 050%95%5%

Page 29: 02 - Logistic Regression - GitHub Pages

Multiclass Classification

Page 30: 02 - Logistic Regression - GitHub Pages

In many applications you have to deal with more than classes.

Page 31: 02 - Logistic Regression - GitHub Pages

one-vs.-all

The simplest way to consider a problem that has more than classesis to adopt the one-vs.-all (or one-against-all) strategy:For each class 𝑘, you can train a single binary classifier ( = for allother classes, and = for class 𝑘). The classifiers return a real-valued likelihood for their decision.The one-vs.-all prediction then picks the label for which the corre-sponding classifier reports the highest likelihood.

Page 32: 02 - Logistic Regression - GitHub Pages

The one-vs.-all approach is a very simple one. However it is an heuris-tic that has many problems.One problem is that for each binary classifier, the negative samples(from all the classes but 𝑘) are more numerous and more heteroge-neous than the positive samples (from class 𝑘).A better approach would be to have a unified model for all classifiersand jointly train them. The extension of Logistic regression that justdoes this is called multinomial logistic regression.

Page 33: 02 - Logistic Regression - GitHub Pages

Multinomial Logistic Regression

In Multinomial Logistic Regression, each of the binary classifier isbased on the following likelihood model:

𝑝( = 𝐶 |x) = softmax(x w) = exp(w x)∑ exp(w x)𝐶 is the class 𝑘 and softmax ∶ ℝ → ℝ is the function defined assoftmax(t) = exp(𝑡 )∑ exp(𝑡 )In other words, softmax takes as an input the vector of logits for allclasses and returns the vector of corresponding likelihoods.

Page 34: 02 - Logistic Regression - GitHub Pages

Softmax Optimisation

To optimise for the parameters. We can take again themaximum like-lihood approach.Combining the likelihood for all possible classes gives us:𝑝( |x) = 𝑝( = 𝐶 |x)[ ] ×⋯× 𝑝( = 𝐶 |x)[ ]where [ = 𝐶 ] is if = 𝐶 and otherwise.The total likelihood is:

𝑝( |X) = 𝑝( = 𝐶 |x )[ ] ×⋯× 𝑝( = 𝐶 |x )[ ]

Page 35: 02 - Logistic Regression - GitHub Pages

Taking the negative log likelihood yields the cross entropy error func-tion for the multiclass problem:

𝐸(w , ⋯ ,w ) = −ln(𝑝( |X)) = − [ = 𝐶 ] ln(𝑝( = 𝐶 |x ))Similarly to logistic regression, we can use a gradient descent ap-proach to find the𝐾 weight vectorsw , ⋯ ,w that minimise this crossentropy expression.

Page 36: 02 - Logistic Regression - GitHub Pages

Take Away

With Logistic Regression, we look at linear models, where the outputof the problem is a binary categorical response.Instead of directly predicting the actual outcome as in least squares,the model proposed in logistic regression makes a prediction aboutthe likelihood of belonging to a particular class.Finding the maximum likelihood parameters is equivalent to minimis-ing the cross entropy loss function. The minimisation can be doneusing the gradient descent technique.The extension of Logistic Regression to more than classes is calledthe Multinomial Logistic Regression.