lecture 4 - generative models for discrete data - part...
TRANSCRIPT
![Page 1: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/1.jpg)
Lecture 4Generative Models for Discrete Data - Part 3
Luigi Freda
ALCOR LabDIAG
University of Rome ”La Sapienza”
October 6, 2017
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 1 / 46
![Page 2: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/2.jpg)
Outline
1 Naive Bayes ClassifiersBasic ConceptsClass-Conditional DistributionsLikelihoodMLEBayesian Naive BayesPriorPosteriorMAPPosterior PredictivePlug-in ApproximationLog-Sum-Exp TrickPosterior Predictive AlgorithmFeature Selection
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 2 / 46
![Page 3: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/3.jpg)
Outline
1 Naive Bayes ClassifiersBasic ConceptsClass-Conditional DistributionsLikelihoodMLEBayesian Naive BayesPriorPosteriorMAPPosterior PredictivePlug-in ApproximationLog-Sum-Exp TrickPosterior Predictive AlgorithmFeature Selection
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 3 / 46
![Page 4: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/4.jpg)
Generative Classifiers vs Discriminative Classifiers
probabilistic classifier
we are given a dataset D = {(xi , yi )}Ni=1
the goal is to compute the class posterior p(y = c|x) which models the mappingy = f (x)
generative classifiers
p(y = c|x) is computed starting from the class-conditional density p(x |y = c,θ)and the class prior p(y = c|θ) given that
p(y = c|x ,θ) ∝ p(x |y = c,θ)p(y = c|θ) (= p(y = c, x |θ))
this is called a generative classifier since it specifies how to generate the featurevector x for each class y = c (by using p(x |y = c,θ))
the model is usually fit by maximizing the joint log-likelihood, i.e. one computesθ∗ = arg max
θ
∑i log p(yi , xi |θ)
discriminative classifiers
the model p(y = c|x) is directly fit to the data
the model is usually fit by maximizing the conditional log-likelihood, i.e. onecomputes θ∗ = arg max
θ
∑i log p(yi |xi ,θ)
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 4 / 46
![Page 5: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/5.jpg)
Naive Bayes ClassifiersBasic Concepts
a Naive Bayes Classifier (NBC) uses a generative approach
let x = [x1, ..., xD ]T be our feature vector with D components1
let y ∈ {1, ...,C} where C is the number of classes
assumption: the D features are assumed to be conditionally independent giventhe class label, i.e.
p(x |y = c,θ) =D∏j=1
p(xj |y = c, θjc)
this is the simplest approach to specify a class-conditional density
it is called ”naive” since we do not actually expect the features to be conditionallyindependent, even conditional to the class label y = c
even if the naive assumption is not true, NBC often works well given that themodel is quite simple and depends on O(CD) parameters and hence is relativelyimmune to overfitting
1one can have x ∈ RD or x ∈ {1, 2, ...,K}D or x ∈ {0, 1}DLuigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 5 / 46
![Page 6: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/6.jpg)
Outline
1 Naive Bayes ClassifiersBasic ConceptsClass-Conditional DistributionsLikelihoodMLEBayesian Naive BayesPriorPosteriorMAPPosterior PredictivePlug-in ApproximationLog-Sum-Exp TrickPosterior Predictive AlgorithmFeature Selection
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 6 / 46
![Page 7: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/7.jpg)
Naive Bayes ClassifiersClass-Conditional Distributions
the form of the class-conditional density depends on the type of each feature
if xj ∈ R we can use the Gaussian distribution
p(x |y = c,θ) =D∏j=1
N (xj |µjc , σ2jc)
where for each class c we specify the mean µjc of feature j and its variance σjc
if xj ∈ {0, 1} we can use the Bernoulli distribution
p(x |y = c,θ) =D∏j=1
Ber(xj |µjc)
where for each class c we specify the probability µjc = p(xj = 1|y = c), i.e. theprobability that feature j occurs
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 7 / 46
![Page 8: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/8.jpg)
Naive Bayes ClassifiersClass-Conditional Distributions
if xj ∈ {1, ...,K} we can use the categorical distribution
p(x |y = c,θ) =D∏j=1
Cat(xj |µjc)
where for each class c we specify the histogram
µjc = [p(xj = 1|y = c), ..., p(xj = K |y = c)]
other kind of features can be conceived and we can mix different kind of features
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 8 / 46
![Page 9: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/9.jpg)
Outline
1 Naive Bayes ClassifiersBasic ConceptsClass-Conditional DistributionsLikelihoodMLEBayesian Naive BayesPriorPosteriorMAPPosterior PredictivePlug-in ApproximationLog-Sum-Exp TrickPosterior Predictive AlgorithmFeature Selection
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 9 / 46
![Page 10: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/10.jpg)
Naive Bayes ClassifiersLikelihood
probability for single data case
p(xi , yi |θ) = p(yi |π)p(xi |yi ,θ) = (NBC assumption) = p(yi |π)∏j
p(xij |yi ,θj)
where θ is a compound vector parameter containing π and θj
since yi ∼ Cat(π)
p(yi |π) =∏c
πI(yi=c)c
for each class c we allocate a specific set of parameters θjc
p(xij |yi ,θj) =∏c
p(xij |θjc)I(yi=c)
hencep(xi , yi |θ) =
∏c
πI(yi=c)c
∏j
∏c
p(xij |θjc)I(yi=c)
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 10 / 46
![Page 11: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/11.jpg)
Naive Bayes ClassifiersLikelihood
the log-likelihood is given by
log p(D|θ) =N∑i=1
log p(xi , yi |θ) =N∑i=1
C∑c=1
log πI(yi=c)c +
N∑i=1
D∑j=1
C∑c=1
log p(xij |θjc)I(yi=c)
=C∑
c=1
Nc log πc +D∑j=1
C∑c=1
∑i :yi=c
log p(xij |θjc)
where Nc ,∑
i I(yi = c) and we assumed as usual that the pairs (xi , yi ) are iid
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 11 / 46
![Page 12: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/12.jpg)
Outline
1 Naive Bayes ClassifiersBasic ConceptsClass-Conditional DistributionsLikelihoodMLEBayesian Naive BayesPriorPosteriorMAPPosterior PredictivePlug-in ApproximationLog-Sum-Exp TrickPosterior Predictive AlgorithmFeature Selection
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 12 / 46
![Page 13: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/13.jpg)
Naive Bayes ClassifiersMLE
the log-likelihood is
log p(D|θ) =C∑
c=1
Nc log πc +D∑j=1
C∑c=1
∑i :yi=c
log p(xij |θjc)
here we have the sum of two terms, the first concerning π = [π1, ..., πC ] and thesecond concerning DC set of parameters θjc
in order to compute the MLE we can optimize the two group of parameters π andθjc separately
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 13 / 46
![Page 14: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/14.jpg)
Naive Bayes ClassifiersMLE
the log-likelihood is
log p(D|θ) =C∑
c=1
Nc log πc +D∑j=1
C∑c=1
∑i :yi=c
log p(xij |θjc)
the first term concerns the labels yi ∼ Cat(π), recall how we computed the MLEof the Dirichlet-multinomial model
the MLE can be computed by optimizing the Lagrangian
l(π, λ) =∑c
Nc log πc + λ
(1−
∑c
πc
)where we enforce the constraint
∑c πc = 1
we impose ∂l∂πc
= 0, ∂l∂λ
= 0 and we obtain the MLE estimation
πc =Nc
N
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 14 / 46
![Page 15: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/15.jpg)
Naive Bayes ClassifiersMLE
the log-likelihood is
log p(D|θ) =C∑
c=1
Nc log πc +D∑j=1
C∑c=1
∑i :yi=c
log p(xij |θjc)
as for the second term optimization, we assume the features xij are binary, i.e.xij ∈ {0, 1}, and xij |y = c ∼ Ber(θjc), hence θjc = θjc ∈ [0, 1]
in this case, we could compute the MLE by using the analysis which wasperformed with the beta-binomial model
doing the math again, we have to optimize the function
J =D∑j=1
C∑c=1
∑i :yi=c
log p(xij |θjc) =D∑j=1
C∑c=1
∑i :yi=c
(I(xij = 1) log θjc+I(xij = 0) log(1−θjc)
)=
=D∑j=1
C∑c=1
Njc log θjc +D∑j=1
C∑c=1
(Nc − Njc) log(1− θjc)
where Njc ,∑
i I(xij = 1, yi = c) and Nc ,∑
i I(yi = c)
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 15 / 46
![Page 16: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/16.jpg)
Naive Bayes ClassifiersMLE
we have to optimize the function
J =D∑j=1
C∑c=1
Njc log θjc +D∑j=1
C∑c=1
(Nc − Njc) log(1− θjc)
where Njc ,∑
i I(xij = 1, yi = c) and Nc ,∑
i I(yi = c)
by imposing∂J
∂θjc= 0 one obtains the MLE estimate
θjc =Njc
Nc
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 16 / 46
![Page 17: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/17.jpg)
Naive Bayes ClassifiersModel Fitting
algorithm: MLE fitting a naive Bayes classifier to binary features (i.e. xi ∈ {0, 1}D)
Nc = 0, Njc = 0 ;for i = 1 : N do
c := yi ; // get the class label of the i-th sample
Nc := Nc + 1;for j = 1 : D do
if xij = 1 thenNjc := Njc + 1
end
end
end
πc = NcN
, θjc =Njc
Nc;
see the naiveBayesFit script for some Matlab code
the algorithm takes O(ND) time
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 17 / 46
![Page 18: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/18.jpg)
Outline
1 Naive Bayes ClassifiersBasic ConceptsClass-Conditional DistributionsLikelihoodMLEBayesian Naive BayesPriorPosteriorMAPPosterior PredictivePlug-in ApproximationLog-Sum-Exp TrickPosterior Predictive AlgorithmFeature Selection
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 18 / 46
![Page 19: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/19.jpg)
Naive Bayes ClassifiersBayesian Reasoning
as we know the MLE estimates can overfit
recall the black swan paradox and the issue of using empiricalfractions Ni/N
a simple solution to overfitting is to be Bayesian
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 19 / 46
![Page 20: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/20.jpg)
Outline
1 Naive Bayes ClassifiersBasic ConceptsClass-Conditional DistributionsLikelihoodMLEBayesian Naive BayesPriorPosteriorMAPPosterior PredictivePlug-in ApproximationLog-Sum-Exp TrickPosterior Predictive AlgorithmFeature Selection
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 20 / 46
![Page 21: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/21.jpg)
The Beta-Binomial ModelPrior
for simplicity we use a factored prior
p(θ) = p(π)D∏j=1
C∏c=1
p(θjc)
where θ is a compound vector parameter containing π, θjc
as for the prior of π we usep(π) = Dir(π|α)
which is a conjugate prior w.r.t. the multinomial part
as for the prior of each θjc we use
p(θjc) = Beta(θjc |β0, β1)
which is a conjugate prior w.r.t. the binomial part
we can obtain a uniform prior by setting α = 1 and β0 = β1 = 1
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 21 / 46
![Page 22: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/22.jpg)
Outline
1 Naive Bayes ClassifiersBasic ConceptsClass-Conditional DistributionsLikelihoodMLEBayesian Naive BayesPriorPosteriorMAPPosterior PredictivePlug-in ApproximationLog-Sum-Exp TrickPosterior Predictive AlgorithmFeature Selection
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 22 / 46
![Page 23: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/23.jpg)
The Beta-Binomial ModelPosterior
factored likelihood
log p(D|θ) = log Cat(y |π) +D∑j=1
C∑c=1
∑i :yi=c
log Ber(xij |θjc)
factored prior
p(θ) = Dir(π|α)D∏j=1
C∏c=1
Beta(θjc |β0, β1)
factored posterior
p(θ|D) = p(π|D)D∏j=1
C∏c=1
p(θjc |D)
p(π|D) = Dir(π|N1 + α1, ...,NC + αC )
p(θjc |D) = Beta(θjc |Njc + β1, (Nc − Njc) + β0)
to compute the posterior we just updates the empirical counts of the likelihoodwith the prior counts
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 23 / 46
![Page 24: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/24.jpg)
Outline
1 Naive Bayes ClassifiersBasic ConceptsClass-Conditional DistributionsLikelihoodMLEBayesian Naive BayesPriorPosteriorMAPPosterior PredictivePlug-in ApproximationLog-Sum-Exp TrickPosterior Predictive AlgorithmFeature Selection
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 24 / 46
![Page 25: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/25.jpg)
The Beta-Binomial ModelMAP
factored posterior
p(θ|D) = p(π|D)D∏j=1
C∏c=1
p(θjc |D)
p(π|D) = Dir(π|N1 + α1, ...,NC + αC )
p(θjc |D) = Beta(θjc |Njc + β1, (Nc − Njc) + β0)
MAP estimate of π = [π1, ..., πC ]
π = arg maxπ
Dir(π|N1 + α1, ...,NC + αC ) =⇒ πc =Nc + αc − 1
N + α0 − C
MAP estimate of θjc for j ∈ {1, ...,D}, c ∈ {1, ...,C}
θjc = arg maxθjc
Beta(θjc |Njc + β1, (Nc − Njc) + β0) =⇒ θjc =Njc + β1 − 1
Nc + β1 + β0 − 2
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 25 / 46
![Page 26: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/26.jpg)
Naive Bayes ClassifiersMAP Model Fitting
algorithm: MAP fitting a naive Bayes classifier to binary features (i.e. xi ∈ {0, 1}D)
Nc = 0, Njc = 0 ;for i = 1 : N do
c := yi ; // get the class label of the i-th sample
Nc := Nc + 1;for j = 1 : D do
if xij = 1 thenNjc := Njc + 1
end
end
end
πc = Nc+αc−1N+α0−C
, θjc =Njc+β1−1
Nc+β1+β0−2;
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 26 / 46
![Page 27: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/27.jpg)
Outline
1 Naive Bayes ClassifiersBasic ConceptsClass-Conditional DistributionsLikelihoodMLEBayesian Naive BayesPriorPosteriorMAPPosterior PredictivePlug-in ApproximationLog-Sum-Exp TrickPosterior Predictive AlgorithmFeature Selection
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 27 / 46
![Page 28: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/28.jpg)
Naive Bayes ClassifiersPosterior Predictive
if we are given a new sample x the posterior predictive is
p(y = c|x ,D) ∝ p(x |y = c,D)p(y = c|D)
with a NBC the class conditional density can factorized as
p(x |y = c,D) =D∏j=1
p(xj |y = c,D)
(since features are assumed to be conditionally independent given the class label)
combining the two above equations returns
p(y = c|x ,D) ∝ p(y = c|D)D∏j=1
p(xj |y = c,D)
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 28 / 46
![Page 29: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/29.jpg)
Naive Bayes ClassifiersPosterior Predictive
we start from this factorization and we apply the Bayesian procedure
p(y = c|x ,D) ∝ p(y = c|D)D∏j=1
p(xj |y = c,D)
first step, we integrate out the unknown π on the first factor
p(y = c|D) =
∫p(y = c,π|D)dπ =
∫p(y = c|π,D)p(π|D)dπ =
(π gives enough information to compute p(y = c)
)=
∫p(y = c|π)p(π|D)dπ
second step, we integrate out the unknowns θjc on each remaining factor
p(xj |y = c,D) =
∫p(xj , θjc |y = c,D)dθjc =
∫p(xj |θjc , y = c,D)p(θjc |y = c,D)dθjc =
(the new x is independent from D
)=
∫p(xj |θjc , y = c)p(θjc |D)dθjc
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 29 / 46
![Page 30: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/30.jpg)
Naive Bayes ClassifiersPosterior Predictive
recollecting everything together returns
p(y = c|x ,D) ∝[ ∫
p(y = c|π)p(π|D)dπ
] D∏j=1
[ ∫p(xj |θjc , y = c)p(θjc |D)dθjc
]and plugging-in the model PDFs/PMFs we adopted
p(y = c|x ,D) ∝[ ∫
Cat(y = c|π)Dir(π|N1 + α1, ...,NC + αC )dπ
]×
D∏j=1
[ ∫Ber(xj |θjc , y = c)Beta(θjc |Njc + β1, (Nc − Njc) + β0)dθjc
]=
the first part is a Dirichlet-multinomial model
the second part is a product of beta-binomial models
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 30 / 46
![Page 31: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/31.jpg)
Naive Bayes ClassifiersPosterior Predictive
doing the math again for the first part∫Cat(y = c|π)Dir(π|N1 + α1, ...,NC + αC )dπ =∫
πc Dir(π|N1 + α1, ...,NC + αC )dπ = E[πc |D] =Nc + αc
N + α0
where α0 =∑
c αc
this is exactly how we computed the posterior mean for the Dirichlet-multinomialmodel
πc = E[πc |D] =Nc + αc
N + α0
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 31 / 46
![Page 32: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/32.jpg)
Naive Bayes ClassifiersPosterior Predictive
doing the math again for the second part∫Ber(xj |θjc , y = c)Beta(θjc |Njc + β1, (Nc − Njc) + β0)dθjc =
=
∫θI(xj=1)
jc (1− θjc)I(xj=0)Beta(θjc |Njc + β1, (Nc − Njc) + β0)dθjc =
= (θjc)I(xj=1)(1− θjc)I(xj=0)
where
θjc = E[θjc |D] =Njc + β1
Nc + β0 + β1
in the above equations we first worked on xj = 1 and then on xj = 0
this is exactly how we computed the posterior mean for the beta-binomial model
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 32 / 46
![Page 33: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/33.jpg)
Naive Bayes ClassifiersPosterior Predictive
the final posterior predictive is
p(y = c|x ,D) ∝ πc
D∏j=1
(θjc)I(xj=1)(1− θjc)I(xj=0)
with the posterior means
θjc = E[θjc |D] =Njc + β1
Nc + β0 + β1
and
πc = E[πc |D] =Nc + αc
N + α0
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 33 / 46
![Page 34: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/34.jpg)
Outline
1 Naive Bayes ClassifiersBasic ConceptsClass-Conditional DistributionsLikelihoodMLEBayesian Naive BayesPriorPosteriorMAPPosterior PredictivePlug-in ApproximationLog-Sum-Exp TrickPosterior Predictive AlgorithmFeature Selection
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 34 / 46
![Page 35: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/35.jpg)
Naive Bayes ClassifiersPlug-in Approximation
we can approximate the posterior with a single point, i.e. p(θ|D) ≈ δθ(θ) where θcan be the MAP or the MLE
we obtain in this case a plug-in approximation
p(y = c|x ,D) ∝ πc
D∏j=1
(θjc)I(xj=1)(1− θjc)I(xj=0)
the plug-in approximation is obviously more prone to overfitting
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 35 / 46
![Page 36: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/36.jpg)
Outline
1 Naive Bayes ClassifiersBasic ConceptsClass-Conditional DistributionsLikelihoodMLEBayesian Naive BayesPriorPosteriorMAPPosterior PredictivePlug-in ApproximationLog-Sum-Exp TrickPosterior Predictive AlgorithmFeature Selection
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 36 / 46
![Page 37: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/37.jpg)
Naive Bayes ClassifiersLog-Sum-Exp Trick
the posterior predictive has the following form
p(y = c|x) =p(x |y = c)p(y = c)
p(x)=
p(x |y = c)p(y = c)∑c′ p(x |y = c ′)p(y = c ′)
p(x |y = c) is often a very small number, especially if x is a high-dimensionalvector, since we have to enforce
∑x′ p(x ′|y = c) = 1
this entails that a naive implementation of the posterior predictive can fail due tonumerical underflow
the obvious solution is to use logs
log p(y = c|x) = log p(x |y = c) + log p(y = c)− log p(x)
and if we define bc , log p(x |y = c) + log p(y = c), one has
log p(y = c|x) = bc − log
[∑c′
ebc′]
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 37 / 46
![Page 38: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/38.jpg)
Naive Bayes ClassifiersLog-Sum-Exp Trick
with bc , log p(x |y = c) + log p(y = c) we have
log p(y = c|x) = bc − log
[∑c′
ebc′]
now we have the problem that computing ebc′ can cause an overflow2
we can use the log-sum-exp trick in order to avoid this problem
log
[∑c
ebc]
= log
[(∑c
ebc−B
)eB]
= log
[∑c
ebc−B
]+ B
where B , maxc bc
with this trick the biggest term ebc−B equals zero
2since bc′ can be a big numberLuigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 38 / 46
![Page 39: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/39.jpg)
Outline
1 Naive Bayes ClassifiersBasic ConceptsClass-Conditional DistributionsLikelihoodMLEBayesian Naive BayesPriorPosteriorMAPPosterior PredictivePlug-in ApproximationLog-Sum-Exp TrickPosterior Predictive AlgorithmFeature Selection
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 39 / 46
![Page 40: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/40.jpg)
Naive Bayes ClassifiersPosterior Predictive Algorithm
the computed posterior predictive is
p(y = c|x ,D) ∝ πc
D∏j=1
(θjc)I(xj=1)(1− θjc)I(xj=0)
if we apply the log we obtain
log p(y = c|x ,D) ∝ log πc +D∑j=1
I(xj = 1) log(θjc) + I(xj = 0) log(1− θjc)
the above log-posterior is the basis for the next algorithm
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 40 / 46
![Page 41: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/41.jpg)
Naive Bayes ClassifiersPosterior Predictive Algorithm
algorithm: predicting with a naive Bayes classifier for binary features (i.e. xi ∈ {0, 1}D)
for c = 1 : C doLc := log πc ;for j = 1 : D do
if xj = 1 then
Lc := Lc + log θjcelse
Lc := Lc + log(1− θjc)end
endpc := exp(Lc − logsumexp(L1:C )); // compute p(y = c|x ,D)
endy := arg max
cpc ;
the above algorithm computes y = arg maxc
p(y = c|x ,D)
the used parameter estimate θ can be obviously best replaced with the posteriormean θ as shown in the computation of the full posterior predictive
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 41 / 46
![Page 42: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/42.jpg)
Outline
1 Naive Bayes ClassifiersBasic ConceptsClass-Conditional DistributionsLikelihoodMLEBayesian Naive BayesPriorPosteriorMAPPosterior PredictivePlug-in ApproximationLog-Sum-Exp TrickPosterior Predictive AlgorithmFeature Selection
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 42 / 46
![Page 43: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/43.jpg)
Feature SelectionBy using Mutual Information
an NBC is commonly used to fit a joint distribution over potentially many features
the NBC fitting algorithm is O(ND) where N is the dataset size and D is the sizeof x
problems: D can be very high and NBC may suffer from overfitting
a common approach to reduce these problems is to perform feature selection:
1 evaluate the relevance of each feature2 hold only the K most relevant features (K is chosen based on some tradeoff
accuracy-complexity)
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 43 / 46
![Page 44: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/44.jpg)
Feature SelectionMutual Information
correlation is a very limited measure of dependence; revise the slides aboutcorrelation and independence (lecture 3 part 2)
a more general approach is to determine how similar is a joint distribution p(X ,Y )to p(X )p(Y ) (recall the definition X ⊥ Y )
mutual information (MI)
I[X ;Y ] , KL[p(X ,Y )||p(X )p(Y )] =∑x
∑y
p(x , y) logp(x , y)
p(x)p(y)
one has I[X ;Y ] ≥ 0 with equality iff p(X ,Y ) = p(X )p(Y )
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 44 / 46
![Page 45: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/45.jpg)
Feature SelectionMutual Information
we want to measure the relevance between feature Xj and the class label Y
I[Xj ;Y ] =∑xj
∑y
p(xj , y) logp(xj , y)
p(xj)p(y)
for an NBC classifier with binary features one has (homework ex 3.21)
Ij , I[Xj ;Y ] =∑c
[θjcπc log
θjcθj
+ (1− θjc)πc log1− θjc1− θc
]where the following quantities are computed by the NBC fitting algorithm:πc = p(y = c), θjc = p(xj = 1|y = c) and θj = p(xj = 1) =
∑c πcθjc
the top K features with the highest Ij can then be selected and used
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 45 / 46
![Page 46: Lecture 4 - Generative Models for Discrete Data - Part 3nlp.jbnu.ac.kr/BML/slides_freda/lec4-part3.pdf · Generative Models for Discrete Data - Part 3 Luigi Freda ALCOR Lab DIAG University](https://reader035.vdocument.in/reader035/viewer/2022071218/604f70257185146b241f383e/html5/thumbnails/46.jpg)
Credits
Kevin Murphy’s book
Luigi Freda (”La Sapienza” University) Lecture 4 October 6, 2017 46 / 46