mean: gaussian prior variance: wishart distributionaarti/class/10701/slides/lecture4.pdfcase study:...
TRANSCRIPT
![Page 1: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/1.jpg)
MAP for Gaussian mean and variance
1
• Conjugate priors
– Mean: Gaussian prior
– Variance: Wishart Distribution
• Prior for mean:
= N(h,l2)
![Page 2: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/2.jpg)
MAP for Gaussian Mean
2
MAP under Gauss-Wishart prior - Homework
(Assuming knownvariance s2)
Independent of s2 if
l2 = s2/s
![Page 3: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/3.jpg)
Bayes Optimal Classifier
Aarti Singh
Machine Learning 10-701/15-781Sept 15, 2010
![Page 4: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/4.jpg)
4
Goal:
Classification
SportsScienceNews
Features, X Labels, Y
Probability of Error
![Page 5: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/5.jpg)
Optimal Classification
Optimal predictor:(Bayes classifier)
5
• Even the optimal classifier makes mistakes R(f*) > 0• Optimal classifier depends on unknown distribution
Bayes risk
0
0.5
1
![Page 6: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/6.jpg)
Optimal Classifier
Bayes Rule:
Optimal classifier:
6
Class conditional density
Class prior
![Page 7: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/7.jpg)
Example Decision Boundaries
• Gaussian class conditional densities (1-dimension/feature)
7
Decision Boundary
![Page 8: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/8.jpg)
Example Decision Boundaries
• Gaussian class conditional densities (2-dimensions/features)
8
Decision Boundary
![Page 9: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/9.jpg)
Learning the Optimal Classifier
Optimal classifier:
9
Need to know Prior P(Y = y) for all y
Likelihood P(X=x|Y = y) for all x,y
Class conditional density
Class prior
![Page 10: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/10.jpg)
Learning the Optimal Classifier
Task: Predict whether or not a picnic spot is enjoyable
Training Data:
Lets learn P(Y|X) – how many parameters?
10
X = (X1 X2 X3 … … Xd) Y
Prior: P(Y = y) for all y
Likelihood: P(X=x|Y = y) for all x,y
n rows
K-1 if K labels
(2d – 1)K if d binary features
![Page 11: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/11.jpg)
Learning the Optimal Classifier
Task: Predict whether or not a picnic spot is enjoyable
Training Data:
Lets learn P(Y|X) – how many parameters?
11
X = (X1 X2 X3 … … Xd) Y
2dK – 1 (K classes, d binary features)
n rows
Need n >> 2dK – 1 number of training data to learn all parameters
![Page 12: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/12.jpg)
Conditional Independence
12
• X is conditionally independent of Y given Z:
probability distribution governing X is independent of the value of Y, given the value of Z
• Equivalent to:
• e.g.,
Note: does NOT mean Thunder is independent of Rain
![Page 13: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/13.jpg)
Conditional vs. Marginal Independence
• C calls A and B separately and tells them a number n ϵ {1,…,10}
• Due to noise in the phone, A and B each imperfectly (and independently) draw a conclusion about what the number was.
• A thinks the number was na and B thinks it was nb.
• Are na and nb marginally independent?
– No, we expect e.g. P(na = 1 | nb = 1) > P(na = 1)
• Are na and nb conditionally independent given n?
– Yes, because if we know the true number, the outcomes na
and nb are purely determined by the noise in each phone.
P(na = 1 | nb = 1, n = 2) = P(na = 1 | n = 2)
13
![Page 14: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/14.jpg)
14
• Predict Lightening
• From two conditionally Independent features
– Thunder
– Rain
# parameters needed to learn likelihood given L
P(T,R|L)
With conditional independence assumption
P(T,R|L) = P(T|L) P(R|L)
Prediction using Conditional Independence
(22-1)2 = 6
(2-1)2 + (2-1)2 = 4
![Page 15: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/15.jpg)
Naïve Bayes Assumption
15
• Naïve Bayes assumption:
– Features are independent given class:
– More generally:
• How many parameters now?
• Suppose X is composed of d binary features
(2-1)dK vs. (2d-1)K
![Page 16: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/16.jpg)
Naïve Bayes Classifier
16
• Given:– Class Prior P(Y)– d conditionally independent features X given the class Y– For each Xi, we have likelihood P(Xi|Y)
• Decision rule:
• If conditional independence assumption holds, NB is optimal classifier! But worse otherwise.
![Page 17: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/17.jpg)
Naïve Bayes Algo – Discrete features
• Training Data
• Maximum Likelihood Estimates
– For Class Prior
– For Likelihood
• NB Prediction for test data
17
![Page 18: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/18.jpg)
Subtlety 1 – Violation of NB Assumption
18
• Usually, features are not conditionally independent:
• Actual probabilities P(Y|X) often biased towards 0 or 1 (Why?)
• Nonetheless, NB is the single most used classifier out there
– NB often performs well, even when assumption is violated– [Domingos & Pazzani ’96] discuss some conditions for good
performance
![Page 19: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/19.jpg)
Subtlety 2 – Insufficient training data
19
• What if you never see a training instance where X1=a when Y=b?– e.g., Y={SpamEmail}, X1={‘Earn’}– P(X1=a | Y=b) = 0
• Thus, no matter what the values X2,…,Xd take:
– P(Y=b | X1=a,X2,…,Xd) = 0
• What now???
![Page 20: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/20.jpg)
MLE vs. MAP
20
• Beta prior equivalent to extra coin flips• As N →1, prior is “forgotten”• But, for small sample size, prior is important!
What if we toss the coin too few times?
• You say: Probability next toss is a head = 0
• Billionaire says: You’re fired! …with prob 1
![Page 21: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/21.jpg)
Naïve Bayes Algo – Discrete features
• Training Data
• Maximum A Posteriori Estimates – add m “virtual” examples
Assume priors
MAP Estimate
Now, even if you never observe a class/feature posterior probability never zero.
21
# virtual examples with Y = b
![Page 22: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/22.jpg)
Case Study: Text Classification
22
• Classify e-mails
– Y = {Spam,NotSpam}
• Classify news articles
– Y = {what is the topic of the article?}
• Classify webpages
– Y = {Student, professor, project, …}
• What about the features X?
– The text!
![Page 23: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/23.jpg)
Features X are entire document – Xi
for ith word in article
23
![Page 24: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/24.jpg)
NB for Text Classification
24
• P(X|Y) is huge!!!– Article at least 1000 words, X={X1,…,X1000}
– Xi represents ith word in document, i.e., the domain of Xi is entire vocabulary, e.g., Webster Dictionary (or more), 10,000 words, etc.
• NB assumption helps a lot!!!– P(Xi=xi|Y=y) is just the probability of observing word xi at the ith position
in a document on topic y
![Page 25: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/25.jpg)
Bag of words model
25
• Typical additional assumption – Position in document doesn’t matter: P(Xi=xi|Y=y) = P(Xk=xi|Y=y) – “Bag of words” model – order of words on the page ignored
– Sounds really silly, but often works very well!
When the lecture is over, remember to wake up the person sitting next to you in the lecture room.
![Page 26: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/26.jpg)
Bag of words model
26
• Typical additional assumption – Position in document doesn’t matter: P(Xi=xi|Y=y) = P(Xk=xi|Y=y) – “Bag of words” model – order of words on the page ignored
– Sounds really silly, but often works very well!
in is lecture lecture next over person remember room sitting the the the to to up wake when you
![Page 27: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/27.jpg)
Bag of words approach
27
aardvark 0
about 2
all 2
Africa 1
apple 0
anxious 0
...
gas 1
...
oil 1
…
Zaire 0
![Page 28: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/28.jpg)
NB with Bag of Words for text classification
28
• Learning phase:
– Class Prior P(Y)
– P(Xi|Y)
• Test phase:
– For each document
• Use naïve Bayes decision rule
Explore in HW
![Page 29: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/29.jpg)
Twenty news groups results
29
![Page 30: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/30.jpg)
Learning curve for twenty news groups
30
![Page 31: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/31.jpg)
What if features are continuous?
31
Eg., character recognition: Xi is intensity at ith pixel
Gaussian Naïve Bayes (GNB):
Different mean and variance for each class k and each pixel i.
Sometimes assume variance
• is independent of Y (i.e., si), • or independent of Xi (i.e., sk)• or both (i.e., s)
![Page 32: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/32.jpg)
Estimating parameters: Y discrete, Xi continuous
32
Maximum likelihood estimates:
jth training imageith pixel in
jth training image
kth class
![Page 33: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/33.jpg)
Example: GNB for classifying mental states
33
~1 mm resolution
~2 images per sec.
15,000 voxels/image
non-invasive, safe
measures Blood Oxygen Level Dependent (BOLD) response
[Mitchell et al.]
![Page 34: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/34.jpg)
Gaussian Naïve Bayes: Learned voxel,word
34
[Mitchell et al.]
15,000 voxelsor features
10 training examples orsubjects perclass
![Page 35: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/35.jpg)
Learned Naïve Bayes Models –Means for P(BrainActivity | WordCategory)
35
Animal wordsPeople words
Pairwise classification accuracy: 85% [Mitchell et al.]
![Page 36: Mean: Gaussian prior Variance: Wishart Distributionaarti/Class/10701/slides/Lecture4.pdfCase Study: Text Classification 22 •Classify e-mails –Y = {Spam,NotSpam} •Classify news](https://reader036.vdocument.in/reader036/viewer/2022081600/6055400f25ea4540437271d2/html5/thumbnails/36.jpg)
What you should know…
36
• Optimal decision using Bayes Classifier
• Naïve Bayes classifier– What’s the assumption
– Why we use it
– How do we learn it
– Why is Bayesian estimation important
• Text classification– Bag of words model
• Gaussian NB– Features are still conditionally independent
– Each feature has a Gaussian distribution given class