p(b|a) * p(a) p(b) prior · 2019-12-10 · towards solving a problem in the doctrine of chances....
TRANSCRIPT
P(B|A) * P(A)
P(B) P(A|B) =
Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418
…by no means merely a curious speculation in the doctrine of chances, but necessary to be solved in order to a sure foundation for all our reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any that would give a clear account of the strength of analogical or inductive reasoning…
Bayes� rule
we call P(A) the �prior� and P(A|B) the �posterior�
Other Forms of Bayes Rule
)(~)|~()()|()()|()|(
APABPAPABPAPABPBAP
+=
)()()|()|(
XBPXAPXABPXBAP
∧
∧∧=∧
Parameter Estimation
• How to estimate parameters from data?
39
Maximum Likelihood Principle:
Choose the parameters that maximize the probability of the observed data!
Maximum Likelihood Estimation Recipe
1. Use the log-likelihood 2. Differentiate with
respect to the parameters
3. *Equate to zero and solve
40
*Often requires numerical approximation (no closed form solution)
An Example
• Let’s start with the simplest possible case – Single observed variable – Flipping a bent coin
41
An Example
• Let’s start with the simplest possible case – Single observed variable – Flipping a bent coin
41
• We Observe: – Sequence of heads or tails – HTTTTTHTHT
• Goal: – Estimate the probability that
the next flip comes up heads
Assumptions
• Fixed parameter – Probability that a flip comes up heads
• Each flip is independent – Doesn’t affect the outcome of other flips
• (IID) Independent and Identically Distributed
42
Example
• Let’s assume we observe the sequence:– HTTTTTHTHT
• What is the best value of ?– Probability of heads
43
Example
• Let’s assume we observe the sequence:– HTTTTTHTHT
• What is the best value of ?– Probability of heads
• Intuition: should be 0.3 (3 out of 10)• Question: how do we justify this?
43
Maximum Likelihood Principle
• The value of which maximizes the probability of the observed data is best!
• Based on our assumptions, the probability of “HTTTTTHTHT” is:
44
Maximum Likelihood Principle
• The value of which maximizes the probability of the observed data is best!
• Based on our assumptions, the probability of “HTTTTTHTHT” is:
44
Maximum Likelihood Principle
• The value of which maximizes the probability of the observed data is best!
• Based on our assumptions, the probability of “HTTTTTHTHT” is:
44
This is the Likelihood Function
Maximum Likelihood Principle• Probability of “HTTTTTHTHT” as a function of
45
Maximum Likelihood Principle• Probability of “HTTTTTHTHT” as a function of
45
Maximum Likelihood Principle• Probability of “HTTTTTHTHT” as a function of
45
Θ=0.3
Maximum Likelihood Principle• Probability of “HTTTTTHTHT” as a function
of
46
Θ=0.3
Maximum Likelihood value of
47
Log Identities
Maximum Likelihood value of
48
Maximum Likelihood value of
48
The problem with Maximum Likelihood
• What if the coin doesn’t look very bent?– Should be somewhere around 0.5?
• What if we saw 3,000 heads and 7,000 tails?– Should this really be the same as 3 out of 10?
49
The problem with Maximum Likelihood
• What if the coin doesn’t look very bent?– Should be somewhere around 0.5?
• What if we saw 3,000 heads and 7,000 tails?– Should this really be the same as 3 out of 10?
• Maximum Likelihood
49
The problem with Maximum Likelihood
• What if the coin doesn’t look very bent?– Should be somewhere around 0.5?
• What if we saw 3,000 heads and 7,000 tails?– Should this really be the same as 3 out of 10?
• Maximum Likelihood– No way to quantify our uncertainty.
49
The problem with Maximum Likelihood
• What if the coin doesn’t look very bent?– Should be somewhere around 0.5?
• What if we saw 3,000 heads and 7,000 tails?– Should this really be the same as 3 out of 10?
• Maximum Likelihood– No way to quantify our uncertainty.– No way to incorporate our prior knowledge!
49
The problem with Maximum Likelihood
• What if the coin doesn’t look very bent?– Should be somewhere around 0.5?
• What if we saw 3,000 heads and 7,000 tails?– Should this really be the same as 3 out of 10?
• Maximum Likelihood– No way to quantify our uncertainty.– No way to incorporate our prior knowledge!
49
Q: how to deal with this problem?
Bayesian Parameter Estimation
• Let’s just treat like any other variable
• Put a prior on it! – Encode our prior knowledge about possible
values of using a probability distribution
• Now consider two probability distributions:
50
Posterior Over
51
Posterior Over
51
Posterior Over
51
My rule is so cool! !
How can we encode prior knowledge?
• Example: The coin doesn’t look very bent – Assign higher probability to values of near 0.5
• Solution: The Beta Distribution
52
How can we encode prior knowledge?
• Example: The coin doesn’t look very bent – Assign higher probability to values of near 0.5
• Solution: The Beta Distribution
52
How can we encode prior knowledge?
• Example: The coin doesn’t look very bent – Assign higher probability to values of near 0.5
• Solution: The Beta Distribution
52
Hyper-Parameters
How can we encode prior knowledge?
• Example: The coin doesn’t look very bent – Assign higher probability to values of near 0.5
• Solution: The Beta Distribution
52
Hyper-Parameters
How can we encode prior knowledge?
• Example: The coin doesn’t look very bent – Assign higher probability to values of near 0.5
• Solution: The Beta Distribution
52
Gamma is a continuous
generalization of the Factorial
Function
Hyper-Parameters
Beta Distribution
53
Beta(5,5) Beta(100,100)
Beta Distribution
53
Beta(5,5) Beta(100,100)
Beta(0.5,0.5) Beta(1,1)
MAP Estimate
54
=#H + ↵� 1
#T +#H + ↵+ � � 2
✓MAP = argmax✓
P (✓|D)
MAP Estimate
54
=#H + ↵� 1
#T +#H + ↵+ � � 2
✓MAP = argmax✓
P (✓|D) -Add-N smoothing -Pseudo-counts