ece595 / stat598: machine learning i lecture 14 logistic ... · lecture 14 logistic regression 1...

25
c Stanley Chan 2020. All Rights Reserved. ECE595 / STAT598: Machine Learning I Lecture 14 Logistic Regression Spring 2020 Stanley Chan School of Electrical and Computer Engineering Purdue University 1 / 25

Upload: others

Post on 23-Jul-2020

8 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

ECE595 / STAT598: Machine Learning ILecture 14 Logistic Regression

Spring 2020

Stanley Chan

School of Electrical and Computer EngineeringPurdue University

1 / 25

Page 2: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

Overview

In linear discriminant analysis (LDA), there are generally two types ofapproaches

Generative approach: Estimate model, then define the classifier

Discriminative approach: Directly define the classifier

2 / 25

Page 3: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

Outline

Discriminative Approaches

Lecture 14 Logistic Regression 1

Lecture 15 Logistic Regression 2

This lecture: Logistic Regression 1

From Linear to LogisticMotivationLoss FunctionWhy not L2 Loss?

Interpreting LogisticMaximum LikelihoodLog-odd

ConvexityIs logistic loss convex?Computation

3 / 25

Page 4: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

Geometry of Linear Regression

The discriminant function g(x) is linear

The hypothesis function h(x) = sign(g(x)) is a unit step

4 / 25

Page 5: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

From Linear to Logistic Regression

Can we replace g(x) by sign(g(x))?

How about a soft-version of sign(g(x))?

This gives a logistic regression.

5 / 25

Page 6: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

Sigmoid Function

The function

h(x) =1

1 + e−g(x)=

1

1 + e−(wTx+w0)

is called a sigmoid function.Its 1D form is

h(x) =1

1 + e−a(x−x0), for some a and x0,

a controls the transient speedx0 controls the cutoff location

6 / 25

Page 7: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

Sigmoid Function

Note that

h(x)→ 1, as x →∞,h(x)→ 0, as x → −∞,

So h(x) can be regarded as a “probability”.

7 / 25

Page 8: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

Sigmoid Function

Derivative is

d

dx

(1

1 + e−a(x−x0)

)= −

(1 + e−a(x−x0)

)−2 (e−a(x−x0)

)(−a)

= a

(e−a(x−x0)

1 + e−a(x−x0)

)(1

1 + e−a(x−x0)

)= a

(1− 1

1 + e−a(x−x0)

)(1

1 + e−a(x−x0)

)= a[1− h(x)][h(x)].

Since 0 < h(x) < 0, we have 0 < 1− h(x) < 1.

Therefore, the derivative is always positive.

So h is an increasing function.

Hence h can be considered as a “CDF”.

8 / 25

Page 9: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

Sigmoid Function

http://georgepavlides.info/wp-content/uploads/2018/02/logistic-binary-e1517639495140.jpg 9 / 25

Page 10: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

From Linear to Logistic Regression

Can we replace g(x) by sign(g(x))?

How about a soft-version of sign(g(x))?

This gives a logistic regression.

10 / 25

Page 11: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

Loss Function for Linear Regression

All discriminant algorithms have a Training Loss Function

J(θ) =1

N

N∑n=1

L(g(xn), yn).

In linear regression,

J(θ) =1

N

N∑n=1

(g(xn)− yn)2

=1

N

N∑n=1

(wTxn + w0 − yn)2

=1

N

∥∥∥∥∥∥∥x

T1 1...

...xTN 1

[ww0

]−

y1...yN

∥∥∥∥∥∥∥2

=1

N‖Aθ − y‖2.

11 / 25

Page 12: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

Training Loss for Logistic Regression

J(θ) =N∑

n=1

L(hθ(xn), yn)

=N∑

n=1

−{yn log hθ(xn) + (1− yn) log(1− hθ(xn))

}This loss is also called the cross-entropy loss.Why do we want to choose this cost function?

Consider two cases

yn log hθ(xn) =

{0, if yn = 1, and hθ(xn) = 1,

−∞, if yn = 1, and hθ(xn) = 0,

(1− yn)(1− log hθ(xn)) =

{0, if yn = 0, and hθ(xn) = 0,

−∞, if yn = 0, and hθ(xn) = 1.

No solution if mismatch12 / 25

Page 13: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

Why Not L2 Loss?

Why not use L2 loss?

J(θ) =N∑

n=1

(hθ(xn)− yn)2

Let’s look at the 1D case:

J(θ) =

(1

1 + e−θx− y

)2

.

This is NOT convex!

How about the logistic loss?

J(θ) = y log

(1

1 + e−θx

)+ (1− y) log

(1− 1

1 + e−θx

)This is convex!

13 / 25

Page 14: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

Why Not L2 Loss?

Experiment: Set x = 1 and y = 1.

Plot J(θ) as a function of θ.

-5 -4 -3 -2 -1 0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

J(

)

-5 -4 -3 -2 -1 0 1 2 3 4 5-6

-5

-4

-3

-2

-1

0

J(

)

L2 Logistic

So the L2 loss is not convex, but the logistic loss is concave (negativeis convex)

If you do gradient descent on L2, you will be trapped at local minima

14 / 25

Page 15: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

Outline

Discriminative Approaches

Lecture 14 Logistic Regression 1

Lecture 15 Logistic Regression 2

This lecture: Logistic Regression 1

From Linear to LogisticMotivationLoss FunctionWhy not L2 Loss?

Interpreting LogisticMaximum LikelihoodLog-odd

ConvexityIs logistic loss convex?Computation

15 / 25

Page 16: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

The Maximum-Likelihood Perspective

We can show that

argminθ

J(θ)

= argminθ

N∑n=1

−{yn log hθ(xn) + (1− yn) log(1− hθ(xn))

}= argmin

θ− log

(N∏

n=1

hθ(xn)yn(1− hθ(xn))1−yn

)

= argmaxθ

N∏n=1

{hθ(xn)yn(1− hθ(xn))1−yn

}.

This is maximum-likelihood for a Bernoulli random variable yn

The underlying probability is hθ(xn)

16 / 25

Page 17: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

Interpreting h(xn)

Maximum-likelihood Bernoulli:

θ∗ = argmaxθ

N∏n=1

{hθ(xn)yn(1− hθ(xn))1−yn

}.

We can interpret hθ(xn) as a probability p. So:

hθ(xn) = p, and 1− hθ(xn) = 1− p.

But p is a function of xn. So how about

hθ(xn) = p(xn), and 1− hθ(xn) = 1− p(xn).

And this probability is “after” you see xn. So how about

hθ(xn) = p(1 | xn), and 1− hθ(xn) = 1− p(1 | xn) = p(0 | xn).

So hθ(xn) is the posterior of observing xn.

17 / 25

Page 18: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

Log-Odds

Let us rewrite J as

J(θ) =N∑

n=1

−{yn log hθ(xn) + (1− yn) log(1− hθ(xn))

}=

n∑n=1

−{yn log

(hθ(xn)

1− hθ(xn)

)+ log(1− hθ(xn))

}In statistics, the term log

(hθ(xn)

1−hθ(xn)

)is called the log-odd.

If we put hθ(xn) = 1

1+e−θT x

, we can show that

log

(hθ(x)

1− hθ(x)

)= log

1

1+e−θT x

e−θT x

1+e−θT x

= log(eθ

Tx

)= θTx .

Logistic regression is linear in the log-odd.

18 / 25

Page 19: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

Outline

Discriminative Approaches

Lecture 14 Logistic Regression 1

Lecture 15 Logistic Regression 2

This lecture: Logistic Regression 1

From Linear to LogisticMotivationLoss FunctionWhy not L2 Loss?

Interpreting LogisticMaximum LikelihoodLog-odd

ConvexityIs logistic loss convex?Computation

19 / 25

Page 20: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

Convexity of Logistic Training Loss

Recall that

J(θ) =n∑

n=1

−{yn log

(hθ(xn)

1− hθ(xn)

)+ log(1− hθ(xn))

}The first term is linear, so it is convex.The second term: Gradient:

∇θ[− log(1− hθ(x))] = −∇θ

[log

(1− 1

1 + e−θTx

)]= −∇θ

[log

e−θTx

1 + e−θTx

]= −∇θ

[log e−θT

x − log(1 + e−θTx)]

= −∇θ

[−θTx − log(1 + e−θT

x)]

= x +∇θ

[log(

1 + e−θTx

)]= x +

(−e−θT

x

1 + e−θTx

)x = hθ(x)x .

20 / 25

Page 21: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

Convexity of Logistic Training Loss

Gradient of second term is

∇θ[− log(1− hθ(x))] = hθ(x)x .

Hessian is:

∇2θ[− log(1− hθ(x))] = ∇θ [hθ(x)x ]

= ∇θ

[(1

1 + e−θTx

)x

]=

(1

(1 + e−θTx)2

)(−e−θT

x

)xxT

=

(1

1 + e−θTx

)(1− 1

1 + e−θTx

)xxT

= hθ(x)[1− hθ(x)]xxT .

21 / 25

Page 22: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

Convexity of Logistic Training Loss

For any v ∈ Rd , we have that

vT∇2θ[− log(1− hθ(x))]v = vT

[hθ(x)[1− hθ(x)]xxT

]v

= (hθ(x)[1− hθ(x)]) ‖vTx‖2 ≥ 0.

Therefore the Hessian is positive semi-definite.

So − log(1− hθ(x) is convex in θ.

Conclusion: The training loss function

J(θ) =n∑

n=1

−{yn log

(hθ(xn)

1− hθ(xn)

)+ log(1− hθ(xn))

}is convex in θ.

So we can use convex optimization algorithms to find θ.

22 / 25

Page 23: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

Convex Optimization for Logistic Regression

We can use CVX to solve the logistic regression problem

But it requires some re-organization of the equations

J(θ) =N∑

n=1

−{ynθ

Txn + log(1− hθ(xn))}

=N∑

n=1

−{ynθ

Txn + log

(1− eθ

Txn

1 + eθTxn

)}=

N∑n=1

−{ynθ

Txn − log(

1 + eθTxn

)}

= −

(

N∑n=1

ynxn

)T

θ −N∑

n=1

log(

1 + eθTxn

) .

The last term is a sum of log-sum-exp: log(e0 + eθT x).

23 / 25

Page 24: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

Convex Optimization for Logistic Regression

0 2 4 6 8 10

0

0.2

0.4

0.6

0.8

1

Data

Estimated

True

24 / 25

Page 25: ECE595 / STAT598: Machine Learning I Lecture 14 Logistic ... · Lecture 14 Logistic Regression 1 Lecture 15 Logistic Regression 2 This lecture: Logistic Regression 1 From Linear to

c©Stanley Chan 2020. All Rights Reserved.

Reading List

Logistic Regression (Machine Learning Perspective)

Chris Bishop’s Pattern Recognition, Chapter 4.3

Hastie-Tibshirani-Friedman’s Elements of Statistical Learning,Chapter 4.4

Stanford CS 229 Discriminant Algorithmshttp://cs229.stanford.edu/notes/cs229-notes1.pdf

CMU Lecture https:

//www.stat.cmu.edu/~cshalizi/uADA/12/lectures/ch12.pdf

Stanford Language Processinghttps://web.stanford.edu/~jurafsky/slp3/ (Lecture 5)

Logistic Regression (Statistics Perspective)

Duke Lecture https://www2.stat.duke.edu/courses/Spring13/

sta102.001/Lec/Lec20.pdf

Princeton Lecturehttps://data.princeton.edu/wws509/notes/c3.pdf

25 / 25