22: mapweb.stanford.edu/class/cs109/lectures/22_map.pdflisa yan, cs109, 2020 map in practice •flip...
TRANSCRIPT
22: MAPLisa YanMay 27, 2020
1
Lisa Yan, CS109, 2020
Quick slide reference
2
3 Maximum a Posteriori Estimator 22a_map
11 Bernoulli MAP: Choosing a prior 23b_bernoulli_any_prior
15 Bernoulli MAP: Conjugate prior 22c_bernoulli_conjugate
24 Common conjugate distributions LIVE
36 Choosing hyperparameters for conjugate prior LIVE
43 Bayesian Envelope demo LIVE
Maximum a Posteriori Estimator
3
22a_map
Lisa Yan, CS109, 2020
Maximum Likelihood Estimator
4
Maximum Likelihood Estimator
(MLE)
What is the parameter !that maximizes the likelihoodof our observed data "!, "", … , "# ?
!$%& = arg max'
+ "!, "", … , "#|!
- ! = + "!, "", … , "#|!
="!"#
$# $!|&
likelihood of dataObservations:• MLE maximizes probability of observing
data given a parameter !.• If we are estimating !, shouldn’t we
maximize the probability of ! directly?
Review
Today: Bayesian estimationusing the Bayesian definition of probability!
Consider a sample of " i.i.d. random variables #', #(, … , #) (data).
Lisa Yan, CS109, 2020
Maximum A Posterior (MAP) Estimator
5
Maximum Likelihood Estimator
(MLE)
What is the parameter !that maximizes the likelihoodof our observed data "!, "", … , "# ?
!$%& = arg max'
+ "!, "", … , "#|!
- ! = + "!, "", … , "#|!
="!"#
$# $!|&
Maximuma Posteriori
(MAP) Estimator
Given our observed data "!, "", … , "# ,
what is the most likely parameter !?
!$() = arg max'
+ !|"!, "", … , "#
likelihood of data
posterior distributionof &
Consider a sample of " i.i.d. random variables #', #(, … , #) (data).
Lisa Yan, CS109, 2020
Maximum A Posterior (MAP) EstimatorConsider a sample of " i.i.d. random variables #', #(, … , #) (data).def The Maximum a Posteriori (MAP) Estimator of ! is the value of ! that
maximizes the posterior distribution of !.
6
!*+, = arg max-
, !|#', #(, … , #)
Intuition with Bayes’ Theorem:
posteriorlikelihood prior
! " data = ! data " ! "! data
- ! , probability of data given parameter !
Before seeing data,prior belief of !
After seeing data, posterior
belief of !
Lisa Yan, CS109, 2020
Solving for !!"#• Observe data: "!, "", … , "#, all i.i.d. • Let likelihood be same as MLE:• Let the prior distribution on ! be . ! .
7
!$() = arg max'
+ !|"!, "", … , "# = arg max'
+ "!, "", … , "#|! . !ℎ "!, "", … , "#
(Bayes’ Theorem)
= arg max'
. ! ∏*+!# + "*| !
ℎ "!, "", … , "# (independence)
= arg max'
. ! 1*+!
#+ "*| ! (1/ℎ $#, $%, … , $$ is a positive constant w.r.t. &)
= arg max'
log . ! +5*+!
#log + "*| !
# $#, $%, … , $$|& ="!"#
$# $!| &
Lisa Yan, CS109, 2020
!!"#: Interpretation 1• Observe data: "!, "", … , "#, all i.i.d. • Let likelihood be same as MLE:• Let the prior distribution of ! be . ! .
8
!$() = arg max'
+ !|"!, "", … , "# = arg max'
+ "!, "", … , "#|! . !ℎ "!, "", … , "#
(Bayes’ Theorem)
= arg max'
. ! ∏*+!# + "*| !
ℎ "!, "", … , "# (independence)
= arg max'
. ! 1*+!
#+ "*| ! (1/ℎ $#, $%, … , $$ is a positive constant w.r.t. &)
= arg max'
log . ! +5*+!
#log + "*| !
# $#, $%, … , $$|& ="!"#
$# $!| &
!$() maximizes log prior + log-likelihood
Lisa Yan, CS109, 2020
!!"#: Interpretation 2• Observe data: "!, "", … , "#, all i.i.d. • Let likelihood be same as MLE:• Let the prior distribution of ! be . ! .
9
!$() = arg max'
+ !|"!, "", … , "# = arg max'
+ "!, "", … , "#|! . !ℎ "!, "", … , "#
(Bayes’ Theorem)
= arg max'
. ! ∏*+!# + "*| !
ℎ "!, "", … , "# (independence)
= arg max'
. ! 1*+!
#+ "*| ! (1/ℎ $#, $%, … , $$ is a positive constant w.r.t. &)
= arg max'
log . ! +5*+!
#log + "*| !
# $#, $%, … , $$|& ="!"#
$# $!| &
The mode of theposterior distribution of !
!$() maximizes log prior + log-likelihood
Lisa Yan, CS109, 2020
Mode: A statistic of a random variableThe mode of a random variable # is defined as:
• Intuitively: The value of # that is “most likely.”• Note that some distributions may not have a unique mode
(e.g., Uniform distribution, or Bernoulli(0.5))
10
arg max3
. / arg max3
, /($ discrete, PMF 4 5 )
($ continuous, PDF # 5 )
!$() is the most likely !given the data "!, "", … , "#.
!*+, = arg max-
, !|#', #(, … , #)
Bernoulli MAP: Choosing a prior
11
22b_bernoulli_any_prior
Lisa Yan, CS109, 2020
How does MAP work? (for Bernoulli)
12
Choose model Bernoulli .
Observe data
Choose prior on !
Find !$() =arg max
'+ !|"!, "", … , "#
" heads, 0 tails
(some . ! )
A lot of our effort in MAP depends on the . ! we choose.
maximizelog prior + log-likelihood
log . ! +5*+!
#log + "*|!
• Differentiate, set to 0• Solve
Lisa Yan, CS109, 2020
MAP for Bernoulli• Flip a coin 8 times. Observe 6 = 7 heads and 8 = 1 tail.• Choose a prior on !. What is !$()?
Suppose we pick a prior !~2 0.5, 1( . . ! = !", :
- .-/.1 !/"
13
1. Determine log prior + log likelihood
2. Differentiatew.r.t. (each) !, set to 0
3. Solve resultingequations
We should choose an ”easier” prior. This one is hard!
log . ! + log + "!, "", … , "#|!
= log12<
:- .-/.1 !/" + log 6 +86 =# 1 − = 3
= − log 2; − 4 − 0.5 %/2 + log @ + A@
+ @ log 4 + A log 1 − 4
− = − 0.5 +6= −
81 − = = 0
cubic equations why
Lisa Yan, CS109, 2020
A better approach: Use conjugate distributions
14
Choose model Bernoulli .
Observe data
Choose prior on !
Find !$() =arg max
'+ !|"!, "", … , "#
" heads, 0 tails
(some . ! ) (choose conjugatedistribution)
⭐maximize
log prior + log-likelihoodUp next: Conjugate
priors are greatfor MAP!
log . ! +5*+!
#log + "*|!
• Differentiate, set to 0• Solve
Bernoulli MAP: Conjugate prior
15
22c_bernoulli_conjugate
Lisa Yan, CS109, 2020
Beta is a conjugate distribution for BernoulliBeta is a conjugate distribution for Bernoulli, meaning:• Prior and posterior parametric forms are the same• Practically, conjugate means easy update:
Add numbers of “successes” and “failures” seen to Beta parameters.• You can set the prior to reflect how fair/biased you think the experiment is apriori.
Mode of Beta(C, D):
16
Prior
Posterior
Experiment Observe 6 successes and 8 failures
Beta(C = 6*345 + 1, D = 8*345 + 1)
Beta C = 6*345 + 6 + 1, D = 8*345 +8 + 1
C − 1C + D − 2
(we’ll prove this in a few minutes)
Review
Beta parameters B, C are called hyperparameters. Interpret Beta(B, C): B + C − 2 trials,
of which B − 1 are successes
Lisa Yan, CS109, 2020
How does MAP work? (for Bernoulli)
17
Choose model Bernoulli .
Observe data
Choose prior on !
Find !$() =arg max
'+ !|"!, "", … , "#
" heads, 0 tails
(some . ! ) (choose conjugatedistribution)
⭐
maximizelog prior + log-likelihood
log . ! +5*+!
#log + "*|!
• Differentiate, set to 0• Solve
Mode of posterior distribution of !
(posterior is alsoconjugate)
Lisa Yan, CS109, 2020
Conjugate strategy: MAP for Bernoulli• Flip a coin 8 times. Observe 6 = 7 heads and 8 = 1 tail.• Choose a prior on !. What is !$()?
18
1. Choose a prior
2. Determine posterior
3. Compute MAP
Suppose we pick a prior !~Beta C, D .
Because Beta is a conjugate distribution for Bernoulli,the posterior distribution is !|G~Beta C + 6, D + 8
Define as data, G
!$() =C + 6 − 1
C + 6 + D +8 − 2(mode of Beta B + @, C + A )
Lisa Yan, CS109, 2020
MAP in practice• Flip a coin 8 times. Observe 6 = 7 heads and 8 = 1 tail.• What is the MAP estimator of the Bernoulli parameter =,
if we assume a prior on = of Beta 2, 2 ?
19
!
Lisa Yan, CS109, 2020
MAP in practice• Flip a coin 8 times. Observe 6 = 7 heads and 8 = 1 tail.• What is the MAP estimator of the Bernoulli parameter =,
if we assume a prior on = of Beta 2, 2 ?
20
1. Choose a prior
2. Determine posterior
3. Compute MAP
!~Beta 2,2 .
Posterior distribution of ! given observed data is Beta 9, 3
!$() =810
Before flipping the coin, we imagined 2 trials:1 imaginary head, 1 imaginary tail.
After the coin, we saw 10 trials: 8 heads (imaginary and real),2 tails (imaginary and real).
Lisa Yan, CS109, 2020
Proving the mode of Beta
21
Choose model Bernoulli .
Observe data
Choose prior on !
Find !$() =arg max
'+ !|"!, "", … , "#
" heads, 0 tails
(some . ! ) (choose conjugate)Beta C, D
These are equivalent interpretations of !$().
We’ll use this equivalence to prove the mode of Beta.
Mode of posterior distribution of !
maximizelog prior + log-likelihood
log . ! +5*+!
#log + "*|!
• Differentiate, set to 0• Solve
(posterior is alsoconjugate)
Lisa Yan, CS109, 2020
From first principles: MAP for Bernoulli, conjugate prior• Flip a coin 8 times. Observe 6 = 7 heads and 8 = 1 tail.• Choose a prior on !. What is !$()?
Suppose we pick a prior !~Beta 7, 8 . . ! = !6 =
4-! 1 − = 7-!
22
1. Determine log prior + log likelihood
2. Differentiatew.r.t. (each) !, set to 0
3. Solve
log . ! + log + "!, "", … , "#|! = log1K =
4-! 1 − = 7-! + log 6 +86 =# 1 − = 3
= log1K + C − 1 log = + D − 1 log 1 − = + log 6 +8
6 + 6 log = +8 log 1 − =
C − 1= +
6= −
D − 11 − = −
81 − = = 0
(next slide)
normalizing constant, 8
Lisa Yan, CS109, 2020
From first principles: MAP for Bernoulli, conjugate prior• Flip a coin 8 times. Observe 6 = 7 heads and 8 = 1 tail.• Choose a prior !. What is !$()?
Suppose we pick a prior !~Beta 7, 8 . . ! = = = !6 =
4-! 1 − = 7-!
23
3. Solve for =
normalizing constant, 8
C − 1= +
6= −
D − 11 − = −
81 − = = 0
!$() =C + 6 − 1
C + 6 + D +8 − 2The mode of the posterior,Beta B + @, C + A !
If we choose a conjugate prior, we avoid calculus with MAP: just report mode of posterior.
(from previous slide)
C + 6 − 1= −
D +8 − 11 − = = 0⟹
(live)22: MAPLisa YanMay 27, 2020
24
Lisa Yan, CS109, 2020
How does MAP work?
25
Choose model with parameter !
Observe data
Choose prior on !
Find !$() = arg max'
+ !|"!, "", … , "#Mode of posterior distribution of !
maximizelog prior + log-likelihood
Review
or
If we choose a conjugate prior, we avoid calculus with MAP: just report mode of posterior.
= arg max'
log . ! +5*+!
#log + "*| !
Two valid interpretations of !$()
Conjugate distributions
26
LIVE
Lisa Yan, CS109, 2020
Quick MAP for Bernoulli and BinomialBeta 7, 8 is a conjugate prior for theprobability of success in theBernoulli and Binomial distributions.
Prior Beta 7, 8Saw 7 + 8 − 2 imaginary trials: 7 − 1 successes, 8 − 1 failures
Experiment Observe " +0 new trials: " successes, 0 failures
Posterior Beta 7 + ", 8 + 0
27
MAP: = =C + 6 − 1
C + D + 6 +8 − 2
+ C =1
L C, D M4-! 1 − M 7-!
Review
Lisa Yan, CS109, 2020
Conjugate distributionsMAPestimator:
28
!*+, = arg max-
, !|#', #(, … , #)The mode of theposterior distribution of !
Distribution parameter Conjugate distributionBernoulli 4 BetaBinomial 4 BetaMultinomial 4! DirichletPoisson F GammaExponential F GammaNormal G NormalNormal H% Inverse Gamma
Don’t need to know Inverse Gamma…but it will know you J
CS109: We’ll only focus on MAP for Bernoulli/Binomial =, Multinomial =*, and Poisson N.
Lisa Yan, CS109, 2020
Multinomial is Multiple times the funDirichlet 7', 7(, … , 7I is a conjugate for Multinomial.• Generalizes Beta in the
same way Multinomialgeneralizes Bernoulli/Binomial:
Prior Dirichlet 7', 7(, … , 7ISaw ∑JK'
I 7J −0 imaginary trials, with 7J − 1 of outcome >
Experiment Observe "' + "( +⋯+ "I new trials, with "J of outcome >
Posterior Dirichlet 7' + "', 7( + "(, … , 7I + "I
29
MAP: =* =C* + 6* − 1
∑*+!3 C* + ∑*+!3 6* −8
+ M!, M", … , M3 =1
L C!, C", … , C31*+!
3M*4"-!
Lisa Yan, CS109, 2020
Good times with GammaGamma @, A is a conjugate for Poisson.• Also conjugate for Exponential,
but we won’t delve into that• Mode of gamma: @ − 1 /A
Prior !~Gamma @, ASaw @ − 1 total imaginary events during A prior time periods
Experiment Observe " events during next C time periods
Posterior !|" events in C periods ~Gamma @ + ", A + C
30
!$() =C + 6 − 1K + P
MAP:
Lisa Yan, CS109, 2020
MAP for PoissonLet D be the average # of successes in a time period.
1. What does it mean to havea prior of !~Gamma 11,5 ?
Now perform the experiment and see 11 events in next 2 time periods.2. Given your prior, what is the
posterior distribution?
3. What is !*+,?
31
Gamma L, Mis conjugate for Poisson
Observe 10 imaginary eventsin 5 time periods,i.e., observe at Poisson rate = 2
!
Mode: &'#(
Lisa Yan, CS109, 2020
MAP for PoissonLet D be the average # of successes in a time period.
1. What does it mean to havea prior of !~Gamma 11,5 ?
Now perform the experiment and see 11 events in next 2 time periods.2. Given your prior, what is the
posterior distribution?
3. What is !*+,?
32
Gamma L, Mis conjugate for Poisson Mode: &'#(
Observe 10 imaginary eventsin 5 time periods,i.e., observe at Poisson rate = 2
!|6 events in P periods ~Gamma 22, 7
!$() = 3, the updated Poisson rate
Interlude for jokes/announcements
33
Lisa Yan, CS109, 2020
Announcements
34
Problem Set 6: No late days or on-time bonus
Quiz 2 Errata
Question 1(d) full credit for everyone
Final course grade progress
To be released after class
Lisa Yan, CS109, 2020
Interesting probability news
35
CS109 Current Events Spreadsheet
What Role Should Employers Play in Testing Workers?
https://www.nytimes.com/2020/05/22/business/employers-coronavirus-testing.html
One nascent strategy circulating among public health experts is running “pooled” coronavirus tests, in which a workplace could combine multiple saliva or nasal swabs into one larger sample representing dozens of employees.
Remember Quiz 1??
Choosing hyperparameters for conjugate prior
36
LIVE
Lisa Yan, CS109, 2020
Where’d you get them priors?• Let ! be the probability a coin turns up heads.• Model ! with 2 different priors:◦ Prior 1: Beta(3,8): 2 imaginary heads,
7 imaginary tails◦ Prior 2: Beta(7,4): 6 imaginary heads,
3 imaginary tails
Now flip 100 coins and get 58 heads and 42 tails.1. What are the two posterior distributions?2. What are the modes of the two posterior distributions?
37
mode: "9
mode: :9
prior
!
Lisa Yan, CS109, 2020
Where’d you get them priors?• Let ! be the probability a coin turns up heads.• Model ! with 2 different priors:◦ Prior 1: Beta(3,8): 2 imaginary heads,
7 imaginary tails◦ Prior 2: Beta(7,4): 6 imaginary heads,
3 imaginary tails
Now flip 100 coins and get 58 heads and 42 tails.
38
mode: "9
mode: :9
prior
posterior
Posterior 1: Beta(61,50) mode: :/!/9
mode: :;!/9Posterior 2: Beta(65,46)
As long as we collect enough data,posteriors will converge to the true value.
Lisa Yan, CS109, 2020
MAP with Laplace smoothing: a prior which represents C imagined observations of each outcome.
• Categorical data (i.e., Multinomial, Bernoulli/Binomial)• Also known as additive smoothing
Laplace estimate Imagine C = 1 of each outcome(follows from Laplace’s “law of succession”)
Example: Laplace estimate for coin probabilities from aforementionedexperiment (100 coins: 58 heads, 42 tails)
Laplace smoothing
39
59102
43102
heads tailsLaplace smoothing:• Easy to implement/remember
Lisa Yan, CS109, 2020
Consider our previous 6-sided die.• Roll the dice 6 = 12 times.• Observe: 3 ones, 2 twos, 0 threes, 3 fours, 1 fives, 3 sixes
Recall !*NO :
What are your Laplace estimates for each roll outcome?
Back to our happy Laplace
40
=! = 3/12, =" = 2/12, =< = 0/12,=; = 3/12, =1 = 1/12, =: = 3/12
⚠
!
Lisa Yan, CS109, 2020
Consider our previous 6-sided die.• Roll the dice 6 = 12 times.• Observe: 3 ones, 2 twos, 0 threes, 3 fours, 1 fives, 3 sixes
Recall !*NO :
What are your Laplace estimates for each roll outcome?
Back to our happy Laplace
41
=! = 3/12, =" = 2/12, =< = 0/12,=; = 3/12, =1 = 1/12, =: = 3/12
=* ="* + 16 +8
=! = 4/18, =" = 3/18, =< = 1/18,=; = 4/18, =1 = 2/18, =: = 4/18
⚠
Laplace smoothing:• Easy to implement/remember• Avoids estimating a parameter of S
Bayesian Envelope Demo
42
LIVE
Lisa Yan, CS109, 2020
Two envelopesTwo envelopes: One contains $#, the other contains $2#.• Select an envelope and open it.• Before opening the envelope, think either equally good.
Is the following reasoning valid?• Let T = $ in envelope you selected.• Let U = $ in other envelope.
Follow-up: What happened byopening the envelope?
43
E F|G =1
2⋅G
2+1
2⋅ 2G =
5
4G
!
Lisa Yan, CS109, 2020
Two envelopesTwo envelopes: One contains $#, the other contains $2#.• Select an envelope and open it.• Before opening the envelope, think either equally good.
Is the following reasoning valid?• Let T = $ in envelope you selected.• Let U = $ in other envelope.
Follow-up: What happened byopening the envelope?
44
E F|G =1
2⋅G
2+1
2⋅ 2G =
5
4G
• Assumes all values of " (where 0 < " < ∞) equally likely
• Infinitely many values of "• So not a true probability distribution over "
(does not integrate to 1)
Lisa Yan, CS109, 2020
Are all values equally likely?
45
0 10 20 10060 8040
Infinite powers of two…
Y"
"
Lisa Yan, CS109, 2020
Two envelopes: The subjectivity of probabilityYour belief about the content of envelopes:• Since implied distribution over # is not a true probability distribution,
what is our distribution over #?
46
Frequentist
Play game infinitely many times, see how often different values come up Problem: you can only play game once
Bayesian
Have prior belief of distribution of $• Prior belief is a subjective
probability (as are all probabilities)• Can answer questions when
no/limited data• As we get more data, prior belief
“swamped” by data
Lisa Yan, CS109, 2020
Two envelopes: The subjectivity of probability
47
0 10 20 10060 8040
Y"
"
Lisa Yan, CS109, 2020
The envelope, pleaseBayesian: Have a prior distribution over #, J #• Let T = $ in envelope you selected. Open envelope to determine T.• Let U = $ in other envelope.
If G > E F|G , keep your envelope, otherwise switch. No inconsistency!!• Opening envelope provides data to compute J #|G
• …which allows you to compute E F|G
Of course, need to think about yourprior distribution over #…
Imagine if envelope you opened contained $20.01. Should you switch?
48
Bayesian probability: It doesn’t matter how you determine your prior, but you must have one (whatever it is)
Lisa Yan, CS109, 2020
How much is a half cent?
49
Lisa Yan, CS109, 2020
Have a wonderful Wednesday!
50
Lisa Yan, CS109, 2020 51
Lisa Yan, CS109, 2020 52