bayesian learning
DESCRIPTION
Bayesian Learning. Conditional Probability. Probability of an event given the occurrence of some other event. E.g., - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/1.jpg)
Bayesian Learning
![Page 2: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/2.jpg)
Conditional Probability
• Probability of an event given the occurrence of some other event.
E.g., • Consider choosing a card from a well-shuffled standard
deck of 52 playing cards. Given that the first card chosen is an ace, what is the probability that the second card chosen will be an ace?
![Page 3: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/3.jpg)
Event space = all possible pairs of cards
YX
)(),(
)()()|(
YPYXP
YPYXPYXP
![Page 4: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/4.jpg)
YFirst card is Ace
)(),(
)()()|(
YPYXP
YPYXPYXP
Event space = all possible pairs of cards
![Page 5: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/5.jpg)
)(),(
)()()|(
YPYXP
YPYXPYXP
Y = First card is Ace
X = Second cardis Ace
Event space = all possible pairs of cards
![Page 6: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/6.jpg)
P(Y) = 4 / 52
P(X,Y) = # possible pairs of aces / total # of pairs
= 4×3/52×51 = 12/2652.
P(X | Y) = (12/2652) / (4 / 52) = 3/51.
![Page 7: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/7.jpg)
)()()|()|(
)()|()()|(),(
)(),(
)()()|(
YPXPXYPYXP
XPXYPYPYXPYXP
YPYXP
YPYXPYXP
:rule Bayes
Deriving Bayes Rule
![Page 8: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/8.jpg)
Bayesian Learning
![Page 9: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/9.jpg)
Application to Machine Learning
• In machine learning we have a space H of hypotheses: h1 , h2 , ... , hn
• We also have a set D of data
• We want to calculate P(h | D)
![Page 10: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/10.jpg)
– Prior probability of h: • P(h): Probability that hypothesis h is true given our
prior knowledge
• If no prior knowledge, all h H are equally probable
– Posterior probability of h:• P(h | D): Probability that hypothesis h is true, given
the data D.
– Likelihood of D:• P(D | h): Probability that we will see data D, given
hypothesis h is true.
Terminology
![Page 11: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/11.jpg)
)()()|()|(
DPhPhDPDhP
Bayes Rule:
Machine Learning Formulation
![Page 12: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/12.jpg)
Example
![Page 13: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/13.jpg)
The Monty Hall Problem
You are a contestant on a game show. There are 3 doors, A, B, and C. There is a new car behind one of them and goats behind the other two.
Monty Hall, the host, asks you to pick a door, any door. You pick door A.
Monty tells you he will open a door , different from A, that has a goat behind it. He opens door B: behind it there is a goat.
Monty now gives you a choice: Stick with your original choice A or switch to C. Should you switch?
http://math.ucsd.edu/~crypto/Monty/monty.html
![Page 14: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/14.jpg)
Bayesian probability formulation
Hypothesis space H: h1 = Car is behind door Ah2 = Car is behind door B h3 = Car is behind door C
Data D = Monty opened B
What is P(h1 | D)? What is P(h2 | D)? What is P(h3 | D)?
![Page 15: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/15.jpg)
Event space
Event space = All possible configurations of cars and goats behind doors A, B, C
Y = Goat behind door B
X = Carbehinddoor A
![Page 16: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/16.jpg)
Y = Goat behind door B
X = Carbehinddoor A
)()()|(
YPYXPYXP
Bayes Rule:
)()()|()|(
YPXPXYPYXP
Event space
![Page 17: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/17.jpg)
Using Bayes’ Rule to solve the Monty Hall problem
By Bayes rule:P(h1|D) = P(D|h1)p(h1) / P(D) = ½ 1/3 / ½ = 1/3
P(h2|D) = P(D|h2)p(h2) / P(D) = 1 1/3 / ½ = 2/3
So you should switch!
You pick door A. Data D = Monty opened door B
Hypothesis space H: h1 = Car is behind door Ah2 = Car is behind door Ch3 = Car is behind door B
What is P(h1 | D)? What is P(h2 | D)? What is P(h3 | D)?
Prior probability: P(h1) = 1/3 P(h2) =1/3 P(h3) =1/3 Likelihood: P(D | h1) = 1/2 P(D | h2) = 1 P(D | h3) = 0
P(D) = p(D|h1)p(h1) + p(D|h2)p(h2) + p(D|h3)p(h3) = 1/6 + 1/3 + 0 = 1/2
![Page 18: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/18.jpg)
)()()|()|(
DPhPhDPDhP
MAP (“maximum a posteriori”) Learning
Bayes rule:
Goal of learning: Find maximum a posteriori hypothesis hMAP:
because P(D) is a constant independent of h.
![Page 19: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/19.jpg)
Note: If every h H is equally probable, then
)|(argmaxMAP hDPhHh
This is called the “maximum likelihood hypothesis”.
![Page 20: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/20.jpg)
A Medical Example
Toby takes a test for leukemia. The test has two outcomes: positive and negative. It is known that if the patient has leukemia, the test is positive 98% of the time. If the patient does not have leukemia, the test is positive 3% of the time. It is also known that 0.008 of the population has leukemia.
Toby’s test is positive.
Which is more likely: Toby has leukemia or Toby does not have leukemia?
![Page 21: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/21.jpg)
• Hypothesis space: h1 = T. has leukemiah2 = T. does not have leukemia
• Prior: 0.008 of the population has leukemia. Thus P(h1) = 0.008P(h2) = 0.992
• Likelihood:P(+ | h1) = 0.98, P(− | h1) = 0.02P(+ | h2) = 0.03, P(− | h2) = 0.97
• Posterior knowledge: Blood test is + for this patient.
![Page 22: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/22.jpg)
• In summary
P(h1) = 0.008, P(h2) = 0.992
P(+ | h1) = 0.98, P(− | h1) = 0.02
P(+ | h2) = 0.03, P(− | h2) = 0.97
• Thus:
![Page 23: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/23.jpg)
• What is P(leukemia|+)?
So,
)()()|()|(
DPhPhDPDhP
These are called the “posterior” probabilities.
![Page 24: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/24.jpg)
In-Class Exercise
Suppose you receive an e-mail message with the subject “Hi”. You have been keeping statistics on your e-mail, and have found that while only 10% of the total e-mail messages you receive are spam, 50% of the spam messages have the subject “Hi” and 2% of the non-spam messages have the subject “Hi”. What is the probability that the message is spam?
![Page 25: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/25.jpg)
Bayesianism vs. Frequentism
• Classical probability: Frequentists– Probability of a particular event is defined relative to its frequency
in a sample space of events.
– E.g., probability of “the coin will come up heads on the next trial” is defined relative to the frequency of heads in a sample space of coin tosses.
• Bayesian probability:– Combine measure of “prior” belief you have in a proposition with
your subsequent observations of events.
• Example: Bayesian can assign probability to statement “There was life on Mars a billion years ago” but frequentist cannot.
![Page 26: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/26.jpg)
Independence and Conditional Independence
• Two random variables, X and Y, are independent if
• Two random variables, X and Y, are independent given Z if
• Examples?
)()(),( YPXPYXP
)|()|()|,( CYPCXPCYXP
![Page 27: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/27.jpg)
Naive Bayes Classifier
Let f (x) be a target function for classification: f (x) {+1, −1}.
Let x = <x1, x2, ..., xn>
We want to find the most probable class value, hMAP,given the data x:
![Page 28: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/28.jpg)
By Bayes Theorem:
P(class) can be estimated from the training data. How?
However, in general, not practical to use training data to estimate P(x1, x2, ..., xn | class). Why not?
![Page 29: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/29.jpg)
• Naive Bayes classifier: Assume
Is this a good assumption?
Given this assumption, here’s how to classify an instance x = <x1, x2, ...,xn>:
Naive Bayes classifier:
Estimate the values of these various probabilities over the training set.
![Page 30: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/30.jpg)
Day Outlook Temp Humidity Wind PlayTennis
D1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak YesD10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No
Training data:
D15 Sunny Cool High Strong ?Test data:
![Page 31: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/31.jpg)
In practice, use training data to compute a probablistic model:
![Page 32: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/32.jpg)
Estimating probabilities
• Recap: In previous example, we had a training set and a new example,
<Outlook=sunny, Temperature=cool, Humidity=high, Wind=strong>
• We asked: What classification is given by a naive Bayes classifier?
• Let n(c) be the number of training instances with class c, and n(xi = ai , c) be the number of training instances with attribute value xi=ai and class c. Then
![Page 33: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/33.jpg)
• Problem with this method: If n(c) is very small, gives a poor estimate.
• E.g., P(Outlook = Overcast | no) = 0.
![Page 34: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/34.jpg)
• Now suppose we want to classify a new instance: <Outlook=overcast, Temperature=cool, Humidity=high,
Wind=strong>. Then:
This incorrectly gives us zero probability due to small sample.
![Page 35: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/35.jpg)
One solution: Laplace smoothing (also called “add-one” smoothing)
For each class cj and attribute xi with value ai, add one “virtual” instance.
That is, recalculate:
where k is the number of possible values of attribute a.
![Page 36: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/36.jpg)
Day Outlook Temp Humidity Wind PlayTennis
D1 Sunny Hot High Weak NoD2 Sunny Hot High Strong NoD3 Overcast Hot High Weak YesD4 Rain Mild High Weak YesD5 Rain Cool Normal Weak YesD6 Rain Cool Normal Strong NoD7 Overcast Cool Normal Strong YesD8 Sunny Mild High Weak NoD9 Sunny Cool Normal Weak YesD10 Rain Mild Normal Weak YesD11 Sunny Mild Normal Strong YesD12 Overcast Mild High Strong YesD13 Overcast Hot Normal Weak YesD14 Rain Mild High Strong No
Training data:
Add virtual instances for Outlook: Outlook=Sunny: Yes Outlook=Overcast: Yes Outlook=Rain: YesOutlook=Sunny: No Outlook=Overcast: No Outlook=Rain: No
P(Outlook=Overcast| No) = 0 / 5 0 + 1 / 5 + 3 = 1/8
![Page 37: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/37.jpg)
Etc.
![Page 38: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/38.jpg)
In-class exercise 2
![Page 39: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/39.jpg)
Naive Bayes on continuous-valued attributes
• How to deal with continuous-valued attributes?
Two possible solutions: – Discretize
– Assume particular probability distribution of classes over values (estimate parameters from training data)
![Page 40: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/40.jpg)
Simplest discretization method
For each attribute xi , create k equal-size bins in interval from min(xi ) to max(xi).
Choose thresholds in between bins.
P(Humidity < 40 | yes) P(40<=Humidity < 80 | yes) P(80<=Humidity < 120 | yes)P(Humidity < 40 | no) P(40<=Humidity < 80 | no) P(80<=Humidity < 120 | no)
Threshold: 40
Humidity: 25, 38, 50, 80, 90, 92, 96, 99
Threshold: 80 Threshold: 120
![Page 41: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/41.jpg)
Questions: What should k be? What if some bins have very few instances?
Problem with balance between discretization bias and variance.
The more bins, the lower the bias, but the higher the variance, due to small sample size.
![Page 42: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/42.jpg)
Alternative simple (but effective) discretization method
(Yang & Webb, 2001)
Let n = number of training examples. For each attribute Ai , create bins. Sort values of Ai in ascending order, and put of them in each bin.
Don’t need add-one smoothing of probabilities
This gives good balance between discretization bias and variance.
nn
![Page 43: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/43.jpg)
Alternative simple (but effective) discretization method
(Yang & Webb, 2001)
Humidity: 25, 38, 50, 80, 90, 92, 96, 99
Let n = number of training examples. For each attribute Ai , create bins. Sort values of Ai in ascending order, and put of them in each bin.
Don’t need add-one smoothing of probabilities
This gives good balance between discretization bias and variance.
nn
![Page 44: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/44.jpg)
Beyond Independence: Conditions for the Optimality of the Simple Bayesian Classifer(P. Domingos and M. Pazzani)
Naive Bayes classifier is called “naive” because it assumes attributes are independent of one another.
![Page 45: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/45.jpg)
• This paper asks: why does the naive (“simple”) Bayes classifier, SBC, do so well in domains with clearly dependent attributes?
![Page 46: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/46.jpg)
Experiments• Compare five classification methods on 30 data sets from
the UCI ML database.
SBC = Simple Bayesian Classifier
Default = “Choose class with most representatives in data”
C4.5 = Quinlan’s decision tree induction system
PEBLS = An instance-based learning system
CN2 = A rule-induction system
![Page 47: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/47.jpg)
• For SBC, numeric values were discretized into ten equal-length intervals.
![Page 48: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/48.jpg)
![Page 49: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/49.jpg)
Number of domains in which SBC was more accurate versus less accurate than corresponding classifier
Same as line 1, but significant at 95% confidence
Average rank over all domains (1 is best in each domain)
![Page 50: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/50.jpg)
Measuring Attribute Dependence
They used a simple, pairwise mutual information measure:
For attributes Am and An , dependence is defined as
where AmAn is a “derived attribute”, whose values consist of the possible combinations of values of Am and An
Note: If Am and An are independent, then D(Am, An | C) = 0.
![Page 51: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/51.jpg)
Results:
(1) SBC is more successful than more complexmethods, even when there is substantial dependence among attributes.
(2) No correlation between degreeof attribute dependence and SBC’s rank.
But why????
![Page 52: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/52.jpg)
• Explanation:
Suppose C = {+1,−1} are the possible classes. Let x be a new example with attributes <a1, a2, ..., an>..
What the naive Bayes classifier does is calculates two probabilities,
and returns the class that has the maximum probability given x.
![Page 53: Bayesian Learning](https://reader036.vdocument.in/reader036/viewer/2022081520/56816552550346895dd7cb1f/html5/thumbnails/53.jpg)
• The probability calculations are correct only if the independence assumption is correct.
• However, the classification is correct in all cases in which the relative ranking of the two probabilities, as calculated by the SBC, is correct!
• The latter covers a lot more cases than the former.
• Thus, the SBC is effective in many cases in which the independence assumption does not hold.