probabilistic calculus to the rescue
DESCRIPTION
Suppose we know the likelihood of each of the (propositional) worlds ( aka Joint Probability distribution ) Then we can use standard rules of probability to compute the likelihood of all queries (as I will remind you) So, Joint Probability Distribution is all that you ever need! - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/1.jpg)
![Page 2: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/2.jpg)
Probabilistic Calculus to the Rescue
Suppose we know the likelihood of each of the (propositional)
worlds (aka Joint Probability distribution )
Then we can use standard rules of probability to compute the likelihood of all queries (as I will remind you)
So, Joint Probability Distribution is all that you ever need!
In the case of Pearl example, we just need the joint probability distribution over B,E,A,J,M (32 numbers)--In general 2n separate numbers
(which should add up to 1)
If Joint Distribution is sufficient for reasoning, what is domain knowledge supposed to help us with?
--Answer: Indirectly by helping us specify the joint probability distribution with fewer than 2n numbers
---The local relations between propositions can be seen as “constraining” the form the joint probability distribution can take!
Burglary => Alarm Earth-Quake => Alarm Alarm => John-callsAlarm => Mary-calls
Only 10 (instead of 32) numbers to specify!
![Page 3: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/3.jpg)
![Page 4: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/4.jpg)
How do we learn the bayes nets?
• We assumed that both the topology and CPTs for bayes nets are given by experts
• What if we want to learn them from data?– And use them to predict other data..
![Page 5: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/5.jpg)
![Page 6: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/6.jpg)
Statistics Probability
![Page 7: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/7.jpg)
HP(H)
P(d|H)
i.i.d
D1 D2 DN
![Page 8: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/8.jpg)
True hypothesis eventually dominates… probability of indefinitely producing uncharacteristic data 0
![Page 9: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/9.jpg)
Bayesian prediction is optimal (Given the hypothesis prior, all other predictions are less likely)
![Page 10: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/10.jpg)
![Page 11: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/11.jpg)
![Page 12: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/12.jpg)
![Page 13: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/13.jpg)
So, BN learning is just probability estimation! (as long as data is complete, and topology is given..)
![Page 14: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/14.jpg)
Works for any topologyB E
A
J M
So, BN learning is just probability estimation?
Data B=T, E=T, A=F, J=T, M=F . . B=F,E=T,A=T,J=F,M=T
P(J|A) = (#data items where J and A are true) (#data items where A is true)
![Page 15: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/15.jpg)
Steps in ML based learning1. Write down an expression for the likelihood of the data as a function of the
parameter(s)Assume i.i.d. distribution
2. Write down the derivative of the log likelihood with respect to each parameter3. Find the parameter values such that the derivatives are zero
There are two ways this step can become complexIndividual (partial) derivatives lead to non-linear functions (depends on the type of distribution the parameters are controlling; binomial is a very easy case)
Individual (partial) derivatives will involve more than one parameter (thus leading to simultaneous equations)
In general, we will need to use continuous function optimization techniquesOne idea is to use gradient descent to find the point where the derivative goes to
zero. But for gradient descent to find global optimum, we need to know for sure that the function we are optimizing has a single optimum (this is why convex functions are important. If the likelihood is a convex function, then gradient descent will be guaranteed to find the global minimum).
![Page 16: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/16.jpg)
Continuous Function Optimization
• Function optimization involves finding the zeroes of the gradient
• We can use Newton-Raphson method
• ..but will need the second derivative…
• ..for a function of n variables, the second derivate is an nxn matrix (called Hessian)
![Page 17: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/17.jpg)
Beyond Known Topology & Complete data!
• So we just noted that if we know the topology of the Bayes net, and we have complete data then the parameters are un-entangled, and can be learned separately from just data counts.
• Questions: How big a deal is this? – Can we have known topology?– Can we have complete data?
• What if there are hidden nodes
![Page 18: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/18.jpg)
Some times you don’t really know the topologyRussel’s restaurant waiting habbits.
![Page 19: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/19.jpg)
Classification as a special case of data modeling
• Until now, we were interesting in learning the model of the entire data (i.e., we want to be able to predict each of the attribute values of the data)
• Sometimes, we are most interested in predicting just a subset (or even one) of the attributes of the data– This will be a “classification” task
![Page 20: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/20.jpg)
Structure (Topology) Learning• Search over different network topologies• Question: How do we decide which
topology is better?– Idea 1: Check if the independence relations
posited by the topology actually hold– Idea 2: Consider which topology agrees
with the data more (i.e., provides higher likelihood)• But need to be careful--increasing edges in a
network cannot reduce likelihood– Idea 3: Need to penalize complexity of the
network (either using prior on network topologies, or using syntactic complexity measures)
1 2
4
8
16
31!
![Page 21: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/21.jpg)
Naïve Bayes Models: The Jean Harlow of Bayesnet Learners..
WillWait
Alt bar Est…
![Page 22: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/22.jpg)
P(willwait=yes) = 6/12 = .5P(Patrons=“full”|willwait=yes) = 2/6=0.333P(Patrons=“some”|willwait=yes)= 4/6=0.666
P(willwait=yes|Patrons=full) = P(patrons=full|willwait=yes) * P(willwait=yes) ----------------------------------------------------------- P(Patrons=full) = k* .333*.5P(willwait=no|Patrons=full) = k* 0.666*.5
Similarly we can show that P(Patrons=“full”|willwait=no) =0.6666
Example
![Page 23: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/23.jpg)
Need for Smoothing.. • Suppose I toss a coin twice, and it
comes up heads both times– What is the empirical probability of
Rao’s coin coming tails?• Suppose I continue to toss the coin
another 3000 times, and it comes heads all these times– What is the empirical probability of
Rao’s coin coming tails?
What is happening? We have a “prior” on the coin tosses
We slowly modify that prior in light of evidence
How do we get NBC to do it?
I beseech you, in the bowels of Christ, think it possible you may be mistaken. --Cromwell to synod of the Church of Scotland; 1650 (aka Cromwell's Rule)
![Page 24: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/24.jpg)
Using M-estimates to improve probablity estimates
• The simple frequency based estimation of P(Ai=vj|Ck) can be inaccurate, especially when the true value is close to zero, and the number of training examples is small (so the probability that your examples don’t contain rare cases is quite high)
• Solution: Use M-estimate P(Ai=vj | Ci) = [#(Ci, Ai=vi) + mp ] / [#(Ci) + m]
– m virtual samples, with p being the probability that each of those samples has Ai=vj
• If we believe that our sample set is large enough, we can keep m small. Otherwise, keep it large.
• Essentially we are augmenting the #(Ci) normal samples with m more virtual samples drawn according to the prior probability on how Ai takes values
– p is the prior probability of Ai taking the value vi• If we don’t have any background information, assume uniform
probability (that is 1/d if Ai can take d values)
Also, to avoid overflow errors do addition of logarithms of probabilities (instead of multiplication of probabilities)
Zero is
FOREVER
![Page 25: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/25.jpg)
Beyond Known Topology & Complete data!
• So we just noted that if we know the topology of the Bayes net, and we have complete data then the parameters are un-entangled, and can be learned separately from just data counts.
• Questions: How big a deal is this? – Can we have known topology?– Can we have complete data?
• What if there are hidden nodes
![Page 26: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/26.jpg)
Missing Data
What should we do? --Idea: Just consider the complete data as the training data Go ahead and learn the parameters --But wait, now that we have parameters, we can infer the missing value! (suppose we infer B to be 1 with 0.7 and 0 with 0.3 prob)--But wait wait, now that we have inferred the missing value we can re-estimate the parameters..
Infinite Regress? No.. Expectation Maximization
Fractional samples
1 1 0 (0.7)1 0 0 (0.3)
![Page 27: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/27.jpg)
![Page 28: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/28.jpg)
Involves Bayes Net inference; can get by with approximate inference
Involves maximization; can get away with just improvement (i.e., a few steps of gradient ascent)
![Page 29: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/29.jpg)
Candy Example
Start with 1000 samples
Initialize parameters as
![Page 30: Probabilistic Calculus to the Rescue](https://reader035.vdocument.in/reader035/viewer/2022062323/56816524550346895dd7a83e/html5/thumbnails/30.jpg)
The “size of the step” is determined adaptively by where the max of the lowerbound is..
--In contrast, gradient descent requires a stepsize parameter --Newton Raphson requires second derivative..
Why does EM Work? Log of Sums don’t have easy closed form optima; use Jensen’s inequality and focus on Sum of logs which will be a lower bound
Ft (J) is an arbitrary prob dist over J
By Jensen’s inequality