mle vs. mapaarti/class/10701/slides/lecture3.pdf · 2010-09-15 · 1 mle vs. map aarti singh...

24
1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

Upload: others

Post on 11-Aug-2020

4 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

1

MLE vs. MAP

Aarti Singh

Machine Learning 10-701/15-781Sept 15, 2010

Page 2: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

MLE vs. MAP

2When is MAP same as MLE?

Maximum Likelihood estimation (MLE)

Choose value that maximizes the probability of observed data

Maximum a posteriori (MAP) estimation

Choose value that is most probable given observed data and prior belief

Page 3: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

MAP using Conjugate Prior

Coin flip problem

Likelihood is ~ Binomial

If prior is Beta distribution,

Then posterior is Beta distribution

For Binomial, conjugate prior is Beta distribution.3

Page 4: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

MLE vs. MAP

4

• Beta prior equivalent to extra coin flips (regularization)• As n →1, prior is “forgotten”• But, for small sample size, prior is important!

What if we toss the coin too few times?

• You say: Probability next toss is a head = 0

• Billionaire says: You’re fired! …with prob 1

Page 5: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

Bayesians vs. Frequentists

5

You are no good when sample is

small You give a different

answer for different

priors

Page 6: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

What about continuous variables?

6

• Billionaire says: If I am measuring a continuous variable, what can you do for me?

• You say: Let me tell you about Gaussians…

= N(m,s2)

m=0 m=0

s2s2

Page 7: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

Gaussian distribution

7

Data, D =

• Parameters: m – mean, s2 - variance

• Sleep hrs are i.i.d.:– Independent events

– Identically distributed according to Gaussian distribution

6543 7 8 9 Sleep hrs

Page 8: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

Properties of Gaussians

8

• affine transformation (multiplying by scalar and adding a constant)

– X ~ N(m,s2)

– Y = aX + b ! Y ~ N(am+b,a2s2)

• Sum of Gaussians

– X ~ N(mX,s2X)

– Y ~ N(mY,s2

Y)

– Z = X+Y ! Z ~ N(mX+mY, s2

X+s2Y)

Page 9: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

MLE for Gaussian mean and variance

9

Page 10: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

MLE for Gaussian mean and variance

Note: MLE for the variance of a Gaussian is biased

– Expected result of estimation is not true parameter!

– Unbiased variance estimator:

10

Page 11: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

MAP for Gaussian mean and variance

11

• Conjugate priors

– Mean: Gaussian prior

– Variance: Wishart Distribution

• Prior for mean:

= N(h,l2)

Page 12: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

MAP for Gaussian Mean

12

MAP under Gauss-Wishart prior - Homework

(Assuming knownvariance s2)

Page 13: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

What you should know…

• Learning parametric distributions: form known, parameters unknown

– Bernoulli (q, probability of flip)

– Gaussian (m, mean and s2, variance)

• MLE

• MAP

13

Page 14: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

What loss function are we minimizing?

• Learning distributions/densities – Unsupervised learning

• Task: Learn (know form of P, except q)

• Experience: D =

• Performance:

14

Negative log Likelihood loss

Page 15: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

Recitation Tomorrow!

• Linear Algebra and Matlab

• Strongly recommended!!

• Place: NSH 1507 (Note: change from last time)

• Time: 5-6 pm

15

Leman

Page 16: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

Bayes Optimal Classifier

Aarti Singh

Machine Learning 10-701/15-781Sept 15, 2010

Page 17: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

17

Goal:

Classification

SportsScienceNews

Features, X Labels, Y

Probability of Error

Page 18: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

Optimal Classification

Optimal predictor:(Bayes classifier)

18

• Even the optimal classifier makes mistakes R(f*) > 0• Optimal classifier depends on unknown distribution

Bayes risk

Page 19: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

Optimal Classifier

Bayes Rule:

Optimal classifier:

19

Class conditional density

Class prior

Page 20: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

Example Decision Boundaries

• Gaussian class conditional densities (1-dimension/feature)

20

Decision Boundary

Page 21: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

Example Decision Boundaries

• Gaussian class conditional densities (2-dimensions/features)

21

Decision Boundary

Page 22: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

Learning the Optimal Classifier

Optimal classifier:

22

Need to know Prior P(Y = y) for all y

Likelihood P(X=x|Y = y) for all x,y

Class conditional density

Class prior

Page 23: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

Learning the Optimal Classifier

Task: Predict whether or not a picnic spot is enjoyable

Training Data:

Lets learn P(Y|X) – how many parameters?

23

X = (X1 X2 X3 … … Xd) Y

Prior: P(Y = y) for all y

Likelihood: P(X=x|Y = y) for all x,y

n rows

K-1 if K labels

(2d – 1)K if d binary features

Page 24: MLE vs. MAPaarti/Class/10701/slides/Lecture3.pdf · 2010-09-15 · 1 MLE vs. MAP Aarti Singh Machine Learning 10-701/15-781 Sept 15, 2010

Learning the Optimal Classifier

Task: Predict whether or not a picnic spot is enjoyable

Training Data:

Lets learn P(Y|X) – how many parameters?

24

X = (X1 X2 X3 … … Xd) Y

2dK – 1 (K classes, d binary features)

n rows

Need n >> 2dK – 1 number of training data to learn all parameters