lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 ·...

Naïve Bayes Lecture 17

David Sontag New York University

Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Mehryar Mohri

Bayesian Learning

•  Use Bayes’ rule!

•  Or equivalently:

•  For uniform priors, this reduces to

maximum likelihood estimation!

Prior

Normalization

Data Likelihood

Posterior

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d

dθ[ln θαH (1− θ)αT ] =

d

dθ[αH ln θ + αT ln(1− θ)]

= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0

δ ≥ 2e−2N�2 ≥ P (mistake)

ln δ ≥ ln 2− 2N�2

N ≥ ln(2/δ)

2�2

N ≥ ln(2/0.05)2× 0.12 ≈ 3.8

0.02= 190

P (θ) ∝ 1

1

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0


ln δ ≥ ln 2− 2N�2

N ≥ ln(2/δ)

2�2

N ≥ ln(2/0.05)2× 0.12 ≈ 3.8

0.02= 190

P (θ) ∝ 1

P (θ | D) ∝ P (D | θ)

1

Prior + Data -‐> Posterior

•  Likelihood funcBon: •  Posterior:

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0


ln δ ≥ ln 2− 2N�2

N ≥ ln(2/δ)

2�2

N ≥ ln(2/0.05)2× 0.12 ≈ 3.8

0.02= 190

P (θ) ∝ 1

P (θ | D) ∝ P (D | θ)

P (θ | D) ∝ θαH (1− θ)αT θβH−1(1− θ)βT−1

1

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0


ln δ ≥ ln 2− 2N�2

N ≥ ln(2/δ)

2�2

N ≥ ln(2/0.05)2× 0.12 ≈ 3.8

0.02= 190

P (θ) ∝ 1

P (θ | D) ∝ P (D | θ)

P (θ | D) ∝ θαH (1− θ)αT θβH−1(1− θ)βT−1

1

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0


ln δ ≥ ln 2− 2N�2

N ≥ ln(2/δ)

2�2

N ≥ ln(2/0.05)2× 0.12 ≈ 3.8

0.02= 190

P (θ) ∝ 1

P (θ | D) ∝ P (D | θ)

P (θ | D) ∝ θαH (1− θ)αT θβH−1(1− θ)βT−1 = θαH+βH−1(1− θ)αT+βt+1

1

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d


d


= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0


ln δ ≥ ln 2− 2N�2

N ≥ ln(2/δ)

2�2

N ≥ ln(2/0.05)2× 0.12 ≈ 3.8

0.02= 190

P (θ) ∝ 1

P (θ | D) ∝ P (D | θ)

P (θ | D) ∝ θαH (1−θ)αT θβH−1(1−θ)βT−1 = θαH+βH−1(1−θ)αT+βt+1 = Beta(αH+βH , αT+βT )

1

What about conBnuous variables?

•  Billionaire says: If I am measuring a conBnuous variable, what can you do for me?

•  You say: Let me tell you about Gaussians…

Some properBes of Gaussians

•  Affine transformaBon (mulBplying by scalar and adding a constant) are Gaussian –  X ~ N(µ,σ2) –  Y = aX + b Y ~ N(aµ+b,a2σ2)

•  Sum of Gaussians is Gaussian –  X ~ N(µX,σ2

X) –  Y ~ N(µY,σ2

Y) –  Z = X+Y Z ~ N(µX+µY, σ2

X+σ2Y)

•  Easy to differenBate, as we will see soon!

Learning a Gaussian

•  Collect a bunch of data – Hopefully, i.i.d. samples

– e.g., exam scores

•  Learn parameters – Mean: µ – Variance: σ

xi i =

Exam Score

0 85

1 95

2 100

3 12

… …

99 89

MLE for Gaussian:

•  Prob. of i.i.d. samples D={x1,…,xN}:

•  Log-likelihood of data:

µMLE , σMLE = argmaxµ,σ

P (D | µ,σ)

2

Your second learning algorithm: MLE for mean of a Gaussian

•  What’s MLE for mean?


P (D | µ,σ)

= −N�

i=1

(xi − µ)

σ2= 0

2


P (D | µ,σ)

= −N�

i=1

(xi − µ)

σ2= 0

2


P (D | µ,σ)

= −N�

i=1

(xi − µ)

σ2= 0

= −N�

i=1

xi +Nµ = 0

= −N

σ+

N�

i=1

(xi − µ)2

σ3= 0

2

MLE for variance •  Again, set derivaBve to zero:


P (D | µ,σ)

= −N�

i=1

(xi − µ)

σ2= 0

2


P (D | µ,σ)

= −N�

i=1

(xi − µ)

σ2= 0

= −N

σ+

N�

i=1

(xi − µ)2

σ3= 0

2

Learning Gaussian parameters

•  MLE:

•  BTW. MLE for the variance of a Gaussian is biased – Expected result of esBmaBon is not true parameter! – Unbiased variance esBmator:

Bayesian learning of Gaussian parameters

•  Conjugate priors – Mean: Gaussian prior

– Variance: Wishart DistribuBon

•  Prior for mean:

pageMehryar Mohri - Introduction to Machine Learning

Bayesian Prediction

Definition: the expected conditional loss of predicting is

Bayesian decision: predict class minimizing expected conditional loss, that is

• zero-one loss:

6

�y ∈ Y

Maximum a Posteriori (MAP) principle.

L[�y|x] =�

y∈YL(�y, y) Pr[y|x].

�y∗ = argminby

L[�y|x] = argminby

�

y∈YL(�y, y) Pr[y|x].

�y∗ = argmaxby

Pr[�y|x].


Binary Classification - Illustration

7

1

0 x

Pr[y1 | x]

Pr[y2 | x]


Maximum a Posteriori (MAP)

Definition: the MAP principle consists of predicting according to the rule

Equivalently, by the Bayes formula:

8

�y = argmaxy∈Y

Pr[y|x].

How do we determine and ?Density estimation problem.

Pr[x|y] Pr[y]

�y = argmaxy∈Y

Pr[x|y] Pr[y]Pr[x]

= argmaxy∈Y

Pr[x|y] Pr[y].


Density Estimation

Data: sample drawn i.i.d. from set according to some distribution ,

Problem: find distribution out of a set that best estimates .

• Note: we will study density estimation specifically in a future lecture.

10

D

x1, . . . , xm ! X.

X

D

p P

[Slide from Mehryar Mohri]

Density esBmaBon

•  Can make parametric assumpBon, e.g. that Pr(x|y) is a mulBvariate Gaussian distribuBon

•  When the dimension of x is small enough, can use a non-‐parametric approach (e.g., kernel density es.ma.on)

x

Difficulty of (naively) esBmaBng high-‐dimensional distribuBons

•  Can we directly esBmate the data distribuBon P(X,Y)?

•  How do we represent these? How many parameters? –  Prior, P(Y):

•  Suppose Y is composed of k classes

–  Likelihood, P(X|Y): •  Suppose X is composed of n binary features

•  Complex model ! High variance with limited data!!!

CondiBonal Independence •  X is condiConally independent of Y given Z, if the probability distribuBon for X is independent of the value of Y, given the value of Z

•  e.g.,

•  Equivalent to:

Naïve Bayes •  Naïve Bayes assumpBon:

–  Features are independent given class:

– More generally:

•  How many parameters now? •  Suppose X is composed of n binary features

The Naïve Bayes Classifier •  Given:

–  Prior P(Y) –  n condiBonally independent features X given the class Y

–  For each Xi, we have likelihood P(Xi|Y)

•  Decision rule:

If certain assumption holds, NB is optimal classifier! (they typically don’t)

Y

X1 Xn X2

lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 ·...

Documents