lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 ·...

20
Naïve Bayes Lecture 17 David Sontag New York University Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Mehryar Mohri

Upload: others

Post on 04-Apr-2020

2 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,

Naïve  Bayes  Lecture  17  

David  Sontag  New  York  University  

Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Mehryar Mohri

Page 2: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,

Bayesian  Learning  

•  Use  Bayes’  rule!  

•  Or equivalently:

•  For uniform priors, this reduces to

maximum likelihood estimation!

Prior

Normalization

Data Likelihood

Posterior

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d

dθ[ln θαH (1− θ)αT ] =

d

dθ[αH ln θ + αT ln(1− θ)]

= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0

δ ≥ 2e−2N�2 ≥ P (mistake)

ln δ ≥ ln 2− 2N�2

N ≥ ln(2/δ)

2�2

N ≥ ln(2/0.05)2× 0.12 ≈ 3.8

0.02= 190

P (θ) ∝ 1

1

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d

dθ[ln θαH (1− θ)αT ] =

d

dθ[αH ln θ + αT ln(1− θ)]

= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0

δ ≥ 2e−2N�2 ≥ P (mistake)

ln δ ≥ ln 2− 2N�2

N ≥ ln(2/δ)

2�2

N ≥ ln(2/0.05)2× 0.12 ≈ 3.8

0.02= 190

P (θ) ∝ 1

P (θ | D) ∝ P (D | θ)

1

Page 3: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,

Prior  +  Data  -­‐>  Posterior  

•  Likelihood  funcBon:  •  Posterior:  

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d

dθ[ln θαH (1− θ)αT ] =

d

dθ[αH ln θ + αT ln(1− θ)]

= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0

δ ≥ 2e−2N�2 ≥ P (mistake)

ln δ ≥ ln 2− 2N�2

N ≥ ln(2/δ)

2�2

N ≥ ln(2/0.05)2× 0.12 ≈ 3.8

0.02= 190

P (θ) ∝ 1

P (θ | D) ∝ P (D | θ)

P (θ | D) ∝ θαH (1− θ)αT θβH−1(1− θ)βT−1

1

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d

dθ[ln θαH (1− θ)αT ] =

d

dθ[αH ln θ + αT ln(1− θ)]

= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0

δ ≥ 2e−2N�2 ≥ P (mistake)

ln δ ≥ ln 2− 2N�2

N ≥ ln(2/δ)

2�2

N ≥ ln(2/0.05)2× 0.12 ≈ 3.8

0.02= 190

P (θ) ∝ 1

P (θ | D) ∝ P (D | θ)

P (θ | D) ∝ θαH (1− θ)αT θβH−1(1− θ)βT−1

1

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d

dθ[ln θαH (1− θ)αT ] =

d

dθ[αH ln θ + αT ln(1− θ)]

= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0

δ ≥ 2e−2N�2 ≥ P (mistake)

ln δ ≥ ln 2− 2N�2

N ≥ ln(2/δ)

2�2

N ≥ ln(2/0.05)2× 0.12 ≈ 3.8

0.02= 190

P (θ) ∝ 1

P (θ | D) ∝ P (D | θ)

P (θ | D) ∝ θαH (1− θ)αT θβH−1(1− θ)βT−1 = θαH+βH−1(1− θ)αT+βt+1

1

Brief Article

The Author

January 11, 2012

θ̂ = argmaxθ

lnP (D | θ)

ln θαH

d

dθlnP (D | θ) =

d

dθ[ln θαH (1− θ)αT ] =

d

dθ[αH ln θ + αT ln(1− θ)]

= αH

d

dθln θ + αT

d

dθln(1− θ) =

αH

θ− αT

1− θ= 0

δ ≥ 2e−2N�2 ≥ P (mistake)

ln δ ≥ ln 2− 2N�2

N ≥ ln(2/δ)

2�2

N ≥ ln(2/0.05)2× 0.12 ≈ 3.8

0.02= 190

P (θ) ∝ 1

P (θ | D) ∝ P (D | θ)

P (θ | D) ∝ θαH (1−θ)αT θβH−1(1−θ)βT−1 = θαH+βH−1(1−θ)αT+βt+1 = Beta(αH+βH , αT+βT )

1

Page 4: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,

What  about  conBnuous  variables?  

•  Billionaire  says:  If  I  am  measuring  a  conBnuous  variable,  what  can  you  do  for  me?  

•  You  say:  Let  me  tell  you  about  Gaussians…  

Page 5: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,

Some  properBes  of  Gaussians  

•  Affine  transformaBon  (mulBplying  by  scalar  and  adding  a  constant)  are  Gaussian  –  X  ~  N(µ,σ2)  –  Y  =  aX  +  b    Y  ~  N(aµ+b,a2σ2)  

•  Sum  of  Gaussians  is  Gaussian  –  X  ~  N(µX,σ2

X)  –  Y  ~  N(µY,σ2

Y)  –  Z  =  X+Y    Z  ~  N(µX+µY,  σ2

X+σ2Y)  

•  Easy  to  differenBate,  as  we  will  see  soon!  

Page 6: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,

Learning  a  Gaussian  

•  Collect  a  bunch  of  data  – Hopefully,  i.i.d.  samples  

– e.g.,  exam  scores  

•  Learn  parameters  – Mean:  µ – Variance:  σ

xi i =

Exam  Score  

0   85  

1   95  

2   100  

3   12  

…   …  

99   89  

Page 7: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,

MLE  for  Gaussian:  

•  Prob.  of  i.i.d.  samples  D={x1,…,xN}:  

•  Log-likelihood of data:

µMLE , σMLE = argmaxµ,σ

P (D | µ,σ)

2

Page 8: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,

Your  second  learning  algorithm:  MLE  for  mean  of  a  Gaussian  

•  What’s  MLE  for  mean?  

µMLE , σMLE = argmaxµ,σ

P (D | µ,σ)

= −N�

i=1

(xi − µ)

σ2= 0

2

µMLE , σMLE = argmaxµ,σ

P (D | µ,σ)

= −N�

i=1

(xi − µ)

σ2= 0

2

µMLE , σMLE = argmaxµ,σ

P (D | µ,σ)

= −N�

i=1

(xi − µ)

σ2= 0

= −N�

i=1

xi +Nµ = 0

= −N

σ+

N�

i=1

(xi − µ)2

σ3= 0

2

Page 9: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,

MLE  for  variance  •  Again,  set  derivaBve  to  zero:  

µMLE , σMLE = argmaxµ,σ

P (D | µ,σ)

= −N�

i=1

(xi − µ)

σ2= 0

2

µMLE , σMLE = argmaxµ,σ

P (D | µ,σ)

= −N�

i=1

(xi − µ)

σ2= 0

= −N

σ+

N�

i=1

(xi − µ)2

σ3= 0

2

Page 10: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,

Learning  Gaussian  parameters  

•  MLE:  

•  BTW.  MLE  for  the  variance  of  a  Gaussian  is  biased  – Expected  result  of  esBmaBon  is  not  true  parameter!    – Unbiased  variance  esBmator:  

Page 11: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,

Bayesian  learning  of  Gaussian  parameters  

•  Conjugate  priors  – Mean:  Gaussian  prior  

– Variance:  Wishart  DistribuBon  

•  Prior  for  mean:  

Page 12: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,

pageMehryar Mohri - Introduction to Machine Learning

Bayesian Prediction

Definition: the expected conditional loss of predicting is

Bayesian decision: predict class minimizing expected conditional loss, that is

• zero-one loss:

6

�y ∈ Y

Maximum a Posteriori (MAP) principle.

L[�y|x] =�

y∈YL(�y, y) Pr[y|x].

�y∗ = argminby

L[�y|x] = argminby

y∈YL(�y, y) Pr[y|x].

�y∗ = argmaxby

Pr[�y|x].

Page 13: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,

pageMehryar Mohri - Introduction to Machine Learning

Binary Classification - Illustration

7

1

0 x

Pr[y1 | x]

Pr[y2 | x]

Page 14: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,

pageMehryar Mohri - Introduction to Machine Learning

Maximum a Posteriori (MAP)

Definition: the MAP principle consists of predicting according to the rule

Equivalently, by the Bayes formula:

8

�y = argmaxy∈Y

Pr[y|x].

How do we determine and ?Density estimation problem.

Pr[x|y] Pr[y]

�y = argmaxy∈Y

Pr[x|y] Pr[y]Pr[x]

= argmaxy∈Y

Pr[x|y] Pr[y].

Page 15: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,

pageMehryar Mohri - Introduction to Machine Learning

Density Estimation

Data: sample drawn i.i.d. from set according to some distribution ,

Problem: find distribution out of a set that best estimates .

• Note: we will study density estimation specifically in a future lecture.

10

D

x1, . . . , xm ! X.

X

D

p P

[Slide from Mehryar Mohri]

Page 16: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,

Density  esBmaBon  

•  Can  make  parametric  assumpBon,  e.g.  that  Pr(x|y)  is  a  mulBvariate  Gaussian  distribuBon  

•  When  the  dimension  of  x  is  small  enough,  can  use  a  non-­‐parametric  approach  (e.g.,  kernel  density  es.ma.on)  

x

Page 17: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,

Difficulty  of  (naively)  esBmaBng  high-­‐dimensional  distribuBons  

•  Can  we  directly  esBmate  the  data  distribuBon  P(X,Y)?  

•  How  do  we  represent  these?  How  many  parameters?  –  Prior,  P(Y):  

•  Suppose  Y  is  composed  of  k  classes  

–  Likelihood,  P(X|Y):  •  Suppose  X  is  composed  of  n  binary  features  

•  Complex  model  !  High  variance  with  limited  data!!!  

Page 18: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,

CondiBonal  Independence    •  X  is  condiConally  independent  of  Y    given  Z,  if  the  probability  distribuBon  for  X  is  independent  of  the  value  of  Y,  given  the  value  of  Z  

•  e.g.,  

•  Equivalent  to:  

Page 19: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,

Naïve  Bayes  •  Naïve  Bayes  assumpBon:  

–  Features  are  independent  given  class:  

– More  generally:  

•  How  many  parameters  now?  •  Suppose  X  is  composed  of  n  binary  features  

Page 20: lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 · Naïve&Bayes& Lecture17 David&Sontag& New&York&University& Slides adapted from Luke Zettlemoyer,

The  Naïve  Bayes  Classifier  •  Given:  

–  Prior  P(Y)  –  n  condiBonally  independent  features  X  given  the  class  Y  

–  For  each  Xi,  we  have  likelihood  P(Xi|Y)  

•  Decision  rule:  

If certain assumption holds, NB is optimal classifier! (they typically don’t)

Y

X1 Xn X2