lecture17people.csail.mit.edu/dsontag/courses/ml12/slides/lecture... · 2012-11-14 ·...
TRANSCRIPT
Naïve Bayes Lecture 17
David Sontag New York University
Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein, and Mehryar Mohri
Bayesian Learning
• Use Bayes’ rule!
• Or equivalently:
• For uniform priors, this reduces to
maximum likelihood estimation!
Prior
Normalization
Data Likelihood
Posterior
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
N ≥ ln(2/δ)
2�2
N ≥ ln(2/0.05)2× 0.12 ≈ 3.8
0.02= 190
P (θ) ∝ 1
1
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
N ≥ ln(2/δ)
2�2
N ≥ ln(2/0.05)2× 0.12 ≈ 3.8
0.02= 190
P (θ) ∝ 1
P (θ | D) ∝ P (D | θ)
1
Prior + Data -‐> Posterior
• Likelihood funcBon: • Posterior:
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
N ≥ ln(2/δ)
2�2
N ≥ ln(2/0.05)2× 0.12 ≈ 3.8
0.02= 190
P (θ) ∝ 1
P (θ | D) ∝ P (D | θ)
P (θ | D) ∝ θαH (1− θ)αT θβH−1(1− θ)βT−1
1
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
N ≥ ln(2/δ)
2�2
N ≥ ln(2/0.05)2× 0.12 ≈ 3.8
0.02= 190
P (θ) ∝ 1
P (θ | D) ∝ P (D | θ)
P (θ | D) ∝ θαH (1− θ)αT θβH−1(1− θ)βT−1
1
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
N ≥ ln(2/δ)
2�2
N ≥ ln(2/0.05)2× 0.12 ≈ 3.8
0.02= 190
P (θ) ∝ 1
P (θ | D) ∝ P (D | θ)
P (θ | D) ∝ θαH (1− θ)αT θβH−1(1− θ)βT−1 = θαH+βH−1(1− θ)αT+βt+1
1
Brief Article
The Author
January 11, 2012
θ̂ = argmaxθ
lnP (D | θ)
ln θαH
d
dθlnP (D | θ) =
d
dθ[ln θαH (1− θ)αT ] =
d
dθ[αH ln θ + αT ln(1− θ)]
= αH
d
dθln θ + αT
d
dθln(1− θ) =
αH
θ− αT
1− θ= 0
δ ≥ 2e−2N�2 ≥ P (mistake)
ln δ ≥ ln 2− 2N�2
N ≥ ln(2/δ)
2�2
N ≥ ln(2/0.05)2× 0.12 ≈ 3.8
0.02= 190
P (θ) ∝ 1
P (θ | D) ∝ P (D | θ)
P (θ | D) ∝ θαH (1−θ)αT θβH−1(1−θ)βT−1 = θαH+βH−1(1−θ)αT+βt+1 = Beta(αH+βH , αT+βT )
1
What about conBnuous variables?
• Billionaire says: If I am measuring a conBnuous variable, what can you do for me?
• You say: Let me tell you about Gaussians…
Some properBes of Gaussians
• Affine transformaBon (mulBplying by scalar and adding a constant) are Gaussian – X ~ N(µ,σ2) – Y = aX + b Y ~ N(aµ+b,a2σ2)
• Sum of Gaussians is Gaussian – X ~ N(µX,σ2
X) – Y ~ N(µY,σ2
Y) – Z = X+Y Z ~ N(µX+µY, σ2
X+σ2Y)
• Easy to differenBate, as we will see soon!
Learning a Gaussian
• Collect a bunch of data – Hopefully, i.i.d. samples
– e.g., exam scores
• Learn parameters – Mean: µ – Variance: σ
xi i =
Exam Score
0 85
1 95
2 100
3 12
… …
99 89
MLE for Gaussian:
• Prob. of i.i.d. samples D={x1,…,xN}:
• Log-likelihood of data:
µMLE , σMLE = argmaxµ,σ
P (D | µ,σ)
2
Your second learning algorithm: MLE for mean of a Gaussian
• What’s MLE for mean?
µMLE , σMLE = argmaxµ,σ
P (D | µ,σ)
= −N�
i=1
(xi − µ)
σ2= 0
2
µMLE , σMLE = argmaxµ,σ
P (D | µ,σ)
= −N�
i=1
(xi − µ)
σ2= 0
2
µMLE , σMLE = argmaxµ,σ
P (D | µ,σ)
= −N�
i=1
(xi − µ)
σ2= 0
= −N�
i=1
xi +Nµ = 0
= −N
σ+
N�
i=1
(xi − µ)2
σ3= 0
2
MLE for variance • Again, set derivaBve to zero:
µMLE , σMLE = argmaxµ,σ
P (D | µ,σ)
= −N�
i=1
(xi − µ)
σ2= 0
2
µMLE , σMLE = argmaxµ,σ
P (D | µ,σ)
= −N�
i=1
(xi − µ)
σ2= 0
= −N
σ+
N�
i=1
(xi − µ)2
σ3= 0
2
Learning Gaussian parameters
• MLE:
• BTW. MLE for the variance of a Gaussian is biased – Expected result of esBmaBon is not true parameter! – Unbiased variance esBmator:
Bayesian learning of Gaussian parameters
• Conjugate priors – Mean: Gaussian prior
– Variance: Wishart DistribuBon
• Prior for mean:
pageMehryar Mohri - Introduction to Machine Learning
Bayesian Prediction
Definition: the expected conditional loss of predicting is
Bayesian decision: predict class minimizing expected conditional loss, that is
• zero-one loss:
6
�y ∈ Y
Maximum a Posteriori (MAP) principle.
L[�y|x] =�
y∈YL(�y, y) Pr[y|x].
�y∗ = argminby
L[�y|x] = argminby
�
y∈YL(�y, y) Pr[y|x].
�y∗ = argmaxby
Pr[�y|x].
pageMehryar Mohri - Introduction to Machine Learning
Binary Classification - Illustration
7
1
0 x
Pr[y1 | x]
Pr[y2 | x]
pageMehryar Mohri - Introduction to Machine Learning
Maximum a Posteriori (MAP)
Definition: the MAP principle consists of predicting according to the rule
Equivalently, by the Bayes formula:
8
�y = argmaxy∈Y
Pr[y|x].
How do we determine and ?Density estimation problem.
Pr[x|y] Pr[y]
�y = argmaxy∈Y
Pr[x|y] Pr[y]Pr[x]
= argmaxy∈Y
Pr[x|y] Pr[y].
pageMehryar Mohri - Introduction to Machine Learning
Density Estimation
Data: sample drawn i.i.d. from set according to some distribution ,
Problem: find distribution out of a set that best estimates .
• Note: we will study density estimation specifically in a future lecture.
10
D
x1, . . . , xm ! X.
X
D
p P
[Slide from Mehryar Mohri]
Density esBmaBon
• Can make parametric assumpBon, e.g. that Pr(x|y) is a mulBvariate Gaussian distribuBon
• When the dimension of x is small enough, can use a non-‐parametric approach (e.g., kernel density es.ma.on)
x
Difficulty of (naively) esBmaBng high-‐dimensional distribuBons
• Can we directly esBmate the data distribuBon P(X,Y)?
• How do we represent these? How many parameters? – Prior, P(Y):
• Suppose Y is composed of k classes
– Likelihood, P(X|Y): • Suppose X is composed of n binary features
• Complex model ! High variance with limited data!!!
CondiBonal Independence • X is condiConally independent of Y given Z, if the probability distribuBon for X is independent of the value of Y, given the value of Z
• e.g.,
• Equivalent to:
Naïve Bayes • Naïve Bayes assumpBon:
– Features are independent given class:
– More generally:
• How many parameters now? • Suppose X is composed of n binary features
The Naïve Bayes Classifier • Given:
– Prior P(Y) – n condiBonally independent features X given the class Y
– For each Xi, we have likelihood P(Xi|Y)
• Decision rule:
If certain assumption holds, NB is optimal classifier! (they typically don’t)
Y
X1 Xn X2