p(b|a) * p(a) p(b) prior · 2019-12-10 · towards solving a problem in the doctrine of chances....

P(B|A) * P(A)

P(B) P(A|B) =

Bayes, Thomas (1763) An essay towards solving a problem in the doctrine of chances. Philosophical Transactions of the Royal Society of London, 53:370-418

…by no means merely a curious speculation in the doctrine of chances, but necessary to be solved in order to a sure foundation for all our reasonings concerning past facts, and what is likely to be hereafter…. necessary to be considered by any that would give a clear account of the strength of analogical or inductive reasoning…

Bayes� rule

we call P(A) the �prior� and P(A|B) the �posterior�

Other Forms of Bayes Rule

)(~)|~()()|()()|()|(

APABPAPABPAPABPBAP

+=

)()()|()|(

XBPXAPXABPXBAP

∧

∧∧=∧

Parameter Estimation

• How to estimate parameters from data?

39

Maximum Likelihood Principle:

Choose the parameters that maximize the probability of the observed data!

Maximum Likelihood Estimation Recipe

1. Use the log-likelihood 2. Differentiate with

respect to the parameters

3. *Equate to zero and solve

40

*Often requires numerical approximation (no closed form solution)

An Example

• Let’s start with the simplest possible case – Single observed variable – Flipping a bent coin

41

An Example

• Let’s start with the simplest possible case – Single observed variable – Flipping a bent coin

41

• We Observe: – Sequence of heads or tails – HTTTTTHTHT

• Goal: – Estimate the probability that

the next flip comes up heads

Assumptions

• Fixed parameter – Probability that a flip comes up heads

• Each flip is independent – Doesn’t affect the outcome of other flips

• (IID) Independent and Identically Distributed

42

Example

• Let’s assume we observe the sequence:– HTTTTTHTHT

• What is the best value of ?– Probability of heads

43

Example

• Let’s assume we observe the sequence:– HTTTTTHTHT

• What is the best value of ?– Probability of heads

• Intuition: should be 0.3 (3 out of 10)• Question: how do we justify this?

43

Maximum Likelihood Principle

• The value of which maximizes the probability of the observed data is best!

• Based on our assumptions, the probability of “HTTTTTHTHT” is:

44

Maximum Likelihood Principle

• The value of which maximizes the probability of the observed data is best!

• Based on our assumptions, the probability of “HTTTTTHTHT” is:

44

This is the Likelihood Function

Maximum Likelihood Principle• Probability of “HTTTTTHTHT” as a function of

45

Maximum Likelihood Principle• Probability of “HTTTTTHTHT” as a function of

45

Θ=0.3

Maximum Likelihood Principle• Probability of “HTTTTTHTHT” as a function

of

46

Θ=0.3

Maximum Likelihood value of

47

Log Identities

Maximum Likelihood value of

48

The problem with Maximum Likelihood

• What if the coin doesn’t look very bent?– Should be somewhere around 0.5?

• What if we saw 3,000 heads and 7,000 tails?– Should this really be the same as 3 out of 10?

49




• Maximum Likelihood

49




• Maximum Likelihood– No way to quantify our uncertainty.

49




• Maximum Likelihood– No way to quantify our uncertainty.– No way to incorporate our prior knowledge!

49




• Maximum Likelihood– No way to quantify our uncertainty.– No way to incorporate our prior knowledge!

49

Q: how to deal with this problem?

Bayesian Parameter Estimation

• Let’s just treat like any other variable

• Put a prior on it! – Encode our prior knowledge about possible

values of using a probability distribution

• Now consider two probability distributions:

50

Posterior Over

51

Posterior Over

51

My rule is so cool! !

How can we encode prior knowledge?

• Example: The coin doesn’t look very bent – Assign higher probability to values of near 0.5

• Solution: The Beta Distribution

52




52

Hyper-Parameters




52

Gamma is a continuous

generalization of the Factorial

Function

Hyper-Parameters

Beta Distribution

53

Beta(5,5) Beta(100,100)

Beta Distribution

53

Beta(5,5) Beta(100,100)

Beta(0.5,0.5) Beta(1,1)

MAP Estimate

54

=#H + ↵� 1

#T +#H + ↵+ � � 2

✓MAP = argmax✓

P (✓|D)

MAP Estimate

54

=#H + ↵� 1

#T +#H + ↵+ � � 2

✓MAP = argmax✓

P (✓|D) -Add-N smoothing -Pseudo-counts

p(b|a) * p(a) p(b) prior · 2019-12-10 · towards solving a problem in the doctrine of chances....

Documents