22: mapweb.stanford.edu/class/cs109/lectures/22_map.pdflisa yan, cs109, 2020 map in practice •flip...

22: MAPLisa YanMay 27, 2020

Lisa Yan, CS109, 2020

Quick slide reference

3 Maximum a Posteriori Estimator 22a_map

11 Bernoulli MAP: Choosing a prior 23b_bernoulli_any_prior

15 Bernoulli MAP: Conjugate prior 22c_bernoulli_conjugate

24 Common conjugate distributions LIVE

36 Choosing hyperparameters for conjugate prior LIVE

43 Bayesian Envelope demo LIVE

Maximum a Posteriori Estimator

22a_map

Maximum Likelihood Estimator

What is the parameter !that maximizes the likelihoodof our observed data "!, "", … , "# ?

!$%& = arg max'

+ "!, "", … , "#|!

- ! = + "!, "", … , "#|!

$# $!|&

likelihood of dataObservations:• MLE maximizes probability of observing

data given a parameter !.• If we are estimating !, shouldn’t we

maximize the probability of ! directly?

Review

Today: Bayesian estimationusing the Bayesian definition of probability!

Consider a sample of " i.i.d. random variables #', #(, … , #) (data).

Maximum A Posterior (MAP) Estimator

Maximum Likelihood Estimator

What is the parameter !that maximizes the likelihoodof our observed data "!, "", … , "# ?

!$%& = arg max'

+ "!, "", … , "#|!

- ! = + "!, "", … , "#|!

$# $!|&

Maximuma Posteriori

(MAP) Estimator

Given our observed data "!, "", … , "# ,

what is the most likely parameter !?

!$() = arg max'

+ !|"!, "", … , "#

likelihood of data

posterior distributionof &

Consider a sample of " i.i.d. random variables #', #(, … , #) (data).

Maximum A Posterior (MAP) EstimatorConsider a sample of " i.i.d. random variables #', #(, … , #) (data).def The Maximum a Posteriori (MAP) Estimator of ! is the value of ! that

maximizes the posterior distribution of !.

!*+, = arg max-

, !|#', #(, … , #)

Intuition with Bayes’ Theorem:

posteriorlikelihood prior

! " data = ! data " ! "! data

- ! , probability of data given parameter !

Before seeing data,prior belief of !

After seeing data, posterior

belief of !

Solving for !!"#• Observe data: "!, "", … , "#, all i.i.d. • Let likelihood be same as MLE:• Let the prior distribution on ! be . ! .

!$() = arg max'

+ !|"!, "", … , "# = arg max'

+ "!, "", … , "#|! . !ℎ "!, "", … , "#

(Bayes’ Theorem)

= arg max'

. ! ∏*+!# + "*| !

ℎ "!, "", … , "# (independence)

= arg max'

. ! 1*+!

#+ "*| ! (1/ℎ $#, $%, … , $$ is a positive constant w.r.t. &)

= arg max'

log . ! +5*+!

#log + "*| !

# $#, $%, … , $$|& ="!"#

$# $!| &

!!"#: Interpretation 1• Observe data: "!, "", … , "#, all i.i.d. • Let likelihood be same as MLE:• Let the prior distribution of ! be . ! .

!$() = arg max'

+ !|"!, "", … , "# = arg max'

+ "!, "", … , "#|! . !ℎ "!, "", … , "#

(Bayes’ Theorem)

= arg max'

. ! ∏*+!# + "*| !

= arg max'

. ! 1*+!

= arg max'

log . ! +5*+!

#log + "*| !

# $#, $%, … , $$|& ="!"#

$# $!| &

!$() maximizes log prior + log-likelihood

!!"#: Interpretation 2• Observe data: "!, "", … , "#, all i.i.d. • Let likelihood be same as MLE:• Let the prior distribution of ! be . ! .

!$() = arg max'

+ !|"!, "", … , "# = arg max'

+ "!, "", … , "#|! . !ℎ "!, "", … , "#

(Bayes’ Theorem)

= arg max'

. ! ∏*+!# + "*| !

= arg max'

. ! 1*+!

= arg max'

log . ! +5*+!

#log + "*| !

# $#, $%, … , $$|& ="!"#

$# $!| &

The mode of theposterior distribution of !

!$() maximizes log prior + log-likelihood

Mode: A statistic of a random variableThe mode of a random variable # is defined as:

• Intuitively: The value of # that is “most likely.”• Note that some distributions may not have a unique mode

(e.g., Uniform distribution, or Bernoulli(0.5))

arg max3

. / arg max3

, /($ discrete, PMF 4 5 )

($ continuous, PDF # 5 )

!$() is the most likely !given the data "!, "", … , "#.

!*+, = arg max-

, !|#', #(, … , #)

Bernoulli MAP: Choosing a prior

22b_bernoulli_any_prior

How does MAP work? (for Bernoulli)

Choose model Bernoulli .

Observe data

Choose prior on !

Find !$() =arg max

'+ !|"!, "", … , "#

" heads, 0 tails

(some . ! )

A lot of our effort in MAP depends on the . ! we choose.

maximizelog prior + log-likelihood

log . ! +5*+!

#log + "*|!

• Differentiate, set to 0• Solve

MAP for Bernoulli• Flip a coin 8 times. Observe 6 = 7 heads and 8 = 1 tail.• Choose a prior on !. What is !$()?

Suppose we pick a prior !~2 0.5, 1( . . ! = !", :

- .-/.1 !/"

1. Determine log prior + log likelihood

2. Differentiatew.r.t. (each) !, set to 0

3. Solve resultingequations

We should choose an ”easier” prior. This one is hard!

log . ! + log + "!, "", … , "#|!

= log12<

:- .-/.1 !/" + log 6 +86 =# 1 − = 3

= − log 2; − 4 − 0.5 %/2 + log @ + A@

+ @ log 4 + A log 1 − 4

− = − 0.5 +6= −

81 − = = 0

cubic equations why

A better approach: Use conjugate distributions

Observe data

Choose prior on !

Find !$() =arg max

'+ !|"!, "", … , "#

" heads, 0 tails

(some . ! ) (choose conjugatedistribution)

⭐maximize

log prior + log-likelihoodUp next: Conjugate

priors are greatfor MAP!

log . ! +5*+!

#log + "*|!

Bernoulli MAP: Conjugate prior

22c_bernoulli_conjugate

Beta is a conjugate distribution for BernoulliBeta is a conjugate distribution for Bernoulli, meaning:• Prior and posterior parametric forms are the same• Practically, conjugate means easy update:

Add numbers of “successes” and “failures” seen to Beta parameters.• You can set the prior to reflect how fair/biased you think the experiment is apriori.

Mode of Beta(C, D):

Posterior

Experiment Observe 6 successes and 8 failures

Beta(C = 6*345 + 1, D = 8*345 + 1)

Beta C = 6*345 + 6 + 1, D = 8*345 +8 + 1

C − 1C + D − 2

(we’ll prove this in a few minutes)

Review

Beta parameters B, C are called hyperparameters. Interpret Beta(B, C): B + C − 2 trials,

of which B − 1 are successes

How does MAP work? (for Bernoulli)

Observe data

Choose prior on !

Find !$() =arg max

'+ !|"!, "", … , "#

" heads, 0 tails

(some . ! ) (choose conjugatedistribution)

log . ! +5*+!

#log + "*|!

Mode of posterior distribution of !

(posterior is alsoconjugate)

Conjugate strategy: MAP for Bernoulli• Flip a coin 8 times. Observe 6 = 7 heads and 8 = 1 tail.• Choose a prior on !. What is !$()?

1. Choose a prior

2. Determine posterior

3. Compute MAP

Suppose we pick a prior !~Beta C, D .

Because Beta is a conjugate distribution for Bernoulli,the posterior distribution is !|G~Beta C + 6, D + 8

Define as data, G

!$() =C + 6 − 1

C + 6 + D +8 − 2(mode of Beta B + @, C + A )

MAP in practice• Flip a coin 8 times. Observe 6 = 7 heads and 8 = 1 tail.• What is the MAP estimator of the Bernoulli parameter =,

if we assume a prior on = of Beta 2, 2 ?

MAP in practice• Flip a coin 8 times. Observe 6 = 7 heads and 8 = 1 tail.• What is the MAP estimator of the Bernoulli parameter =,

if we assume a prior on = of Beta 2, 2 ?

1. Choose a prior

2. Determine posterior

3. Compute MAP

!~Beta 2,2 .

Posterior distribution of ! given observed data is Beta 9, 3

!$() =810

Before flipping the coin, we imagined 2 trials:1 imaginary head, 1 imaginary tail.

After the coin, we saw 10 trials: 8 heads (imaginary and real),2 tails (imaginary and real).

Proving the mode of Beta

Observe data

Choose prior on !

Find !$() =arg max

'+ !|"!, "", … , "#

" heads, 0 tails

(some . ! ) (choose conjugate)Beta C, D

These are equivalent interpretations of !$().

We’ll use this equivalence to prove the mode of Beta.

Mode of posterior distribution of !

log . ! +5*+!

#log + "*|!

(posterior is alsoconjugate)

From first principles: MAP for Bernoulli, conjugate prior• Flip a coin 8 times. Observe 6 = 7 heads and 8 = 1 tail.• Choose a prior on !. What is !$()?

Suppose we pick a prior !~Beta 7, 8 . . ! = !6 =

4-! 1 − = 7-!

1. Determine log prior + log likelihood

2. Differentiatew.r.t. (each) !, set to 0

3. Solve

log . ! + log + "!, "", … , "#|! = log1K =

4-! 1 − = 7-! + log 6 +86 =# 1 − = 3

= log1K + C − 1 log = + D − 1 log 1 − = + log 6 +8

6 + 6 log = +8 log 1 − =

C − 1= +

6= −

D − 11 − = −

81 − = = 0

(next slide)

normalizing constant, 8

From first principles: MAP for Bernoulli, conjugate prior• Flip a coin 8 times. Observe 6 = 7 heads and 8 = 1 tail.• Choose a prior !. What is !$()?

Suppose we pick a prior !~Beta 7, 8 . . ! = = = !6 =

4-! 1 − = 7-!

3. Solve for =

normalizing constant, 8

C − 1= +

6= −

D − 11 − = −

81 − = = 0

!$() =C + 6 − 1

C + 6 + D +8 − 2The mode of the posterior,Beta B + @, C + A !

If we choose a conjugate prior, we avoid calculus with MAP: just report mode of posterior.

(from previous slide)

C + 6 − 1= −

D +8 − 11 − = = 0⟹

(live)22: MAPLisa YanMay 27, 2020

How does MAP work?

Choose model with parameter !

Observe data

Choose prior on !

Find !$() = arg max'

+ !|"!, "", … , "#Mode of posterior distribution of !

Review

If we choose a conjugate prior, we avoid calculus with MAP: just report mode of posterior.

= arg max'

log . ! +5*+!

#log + "*| !

Two valid interpretations of !$()

Conjugate distributions

Quick MAP for Bernoulli and BinomialBeta 7, 8 is a conjugate prior for theprobability of success in theBernoulli and Binomial distributions.

Prior Beta 7, 8Saw 7 + 8 − 2 imaginary trials: 7 − 1 successes, 8 − 1 failures

Experiment Observe " +0 new trials: " successes, 0 failures

Posterior Beta 7 + ", 8 + 0

MAP: = =C + 6 − 1

C + D + 6 +8 − 2

+ C =1

L C, D M4-! 1 − M 7-!

Review

Conjugate distributionsMAPestimator:

!*+, = arg max-

, !|#', #(, … , #)The mode of theposterior distribution of !

Distribution parameter Conjugate distributionBernoulli 4 BetaBinomial 4 BetaMultinomial 4! DirichletPoisson F GammaExponential F GammaNormal G NormalNormal H% Inverse Gamma

Don’t need to know Inverse Gamma…but it will know you J

CS109: We’ll only focus on MAP for Bernoulli/Binomial =, Multinomial =*, and Poisson N.

Multinomial is Multiple times the funDirichlet 7', 7(, … , 7I is a conjugate for Multinomial.• Generalizes Beta in the

same way Multinomialgeneralizes Bernoulli/Binomial:

Prior Dirichlet 7', 7(, … , 7ISaw ∑JK'

I 7J −0 imaginary trials, with 7J − 1 of outcome >

Experiment Observe "' + "( +⋯+ "I new trials, with "J of outcome >

Posterior Dirichlet 7' + "', 7( + "(, … , 7I + "I

MAP: =* =C* + 6* − 1

∑*+!3 C* + ∑*+!3 6* −8

+ M!, M", … , M3 =1

L C!, C", … , C31*+!

3M*4"-!

Good times with GammaGamma @, A is a conjugate for Poisson.• Also conjugate for Exponential,

but we won’t delve into that• Mode of gamma: @ − 1 /A

Prior !~Gamma @, ASaw @ − 1 total imaginary events during A prior time periods

Experiment Observe " events during next C time periods

Posterior !|" events in C periods ~Gamma @ + ", A + C

!$() =C + 6 − 1K + P

MAP for PoissonLet D be the average # of successes in a time period.

1. What does it mean to havea prior of !~Gamma 11,5 ?

Now perform the experiment and see 11 events in next 2 time periods.2. Given your prior, what is the

posterior distribution?

3. What is !*+,?

Gamma L, Mis conjugate for Poisson

Observe 10 imaginary eventsin 5 time periods,i.e., observe at Poisson rate = 2

Mode: &'#(

MAP for PoissonLet D be the average # of successes in a time period.

1. What does it mean to havea prior of !~Gamma 11,5 ?

Now perform the experiment and see 11 events in next 2 time periods.2. Given your prior, what is the

posterior distribution?

3. What is !*+,?

Gamma L, Mis conjugate for Poisson Mode: &'#(

Observe 10 imaginary eventsin 5 time periods,i.e., observe at Poisson rate = 2

!|6 events in P periods ~Gamma 22, 7

!$() = 3, the updated Poisson rate

Interlude for jokes/announcements

Announcements

Problem Set 6: No late days or on-time bonus

Quiz 2 Errata

Question 1(d) full credit for everyone

Final course grade progress

To be released after class

Interesting probability news

CS109 Current Events Spreadsheet

What Role Should Employers Play in Testing Workers?

https://www.nytimes.com/2020/05/22/business/employers-coronavirus-testing.html

One nascent strategy circulating among public health experts is running “pooled” coronavirus tests, in which a workplace could combine multiple saliva or nasal swabs into one larger sample representing dozens of employees.

Remember Quiz 1??

Choosing hyperparameters for conjugate prior

Where’d you get them priors?• Let ! be the probability a coin turns up heads.• Model ! with 2 different priors:◦ Prior 1: Beta(3,8): 2 imaginary heads,

7 imaginary tails◦ Prior 2: Beta(7,4): 6 imaginary heads,

3 imaginary tails

Now flip 100 coins and get 58 heads and 42 tails.1. What are the two posterior distributions?2. What are the modes of the two posterior distributions?

mode: "9

mode: :9

Where’d you get them priors?• Let ! be the probability a coin turns up heads.• Model ! with 2 different priors:◦ Prior 1: Beta(3,8): 2 imaginary heads,

7 imaginary tails◦ Prior 2: Beta(7,4): 6 imaginary heads,

3 imaginary tails

Now flip 100 coins and get 58 heads and 42 tails.

mode: "9

mode: :9

posterior

Posterior 1: Beta(61,50) mode: :/!/9

mode: :;!/9Posterior 2: Beta(65,46)

As long as we collect enough data,posteriors will converge to the true value.

MAP with Laplace smoothing: a prior which represents C imagined observations of each outcome.

• Categorical data (i.e., Multinomial, Bernoulli/Binomial)• Also known as additive smoothing

Laplace estimate Imagine C = 1 of each outcome(follows from Laplace’s “law of succession”)

Example: Laplace estimate for coin probabilities from aforementionedexperiment (100 coins: 58 heads, 42 tails)

Laplace smoothing

heads tailsLaplace smoothing:• Easy to implement/remember

Consider our previous 6-sided die.• Roll the dice 6 = 12 times.• Observe: 3 ones, 2 twos, 0 threes, 3 fours, 1 fives, 3 sixes

Recall !*NO :

What are your Laplace estimates for each roll outcome?

Back to our happy Laplace

=! = 3/12, =" = 2/12, =< = 0/12,=; = 3/12, =1 = 1/12, =: = 3/12

Consider our previous 6-sided die.• Roll the dice 6 = 12 times.• Observe: 3 ones, 2 twos, 0 threes, 3 fours, 1 fives, 3 sixes

Recall !*NO :

What are your Laplace estimates for each roll outcome?

Back to our happy Laplace

=! = 3/12, =" = 2/12, =< = 0/12,=; = 3/12, =1 = 1/12, =: = 3/12

=* ="* + 16 +8

=! = 4/18, =" = 3/18, =< = 1/18,=; = 4/18, =1 = 2/18, =: = 4/18

Laplace smoothing:• Easy to implement/remember• Avoids estimating a parameter of S

Bayesian Envelope Demo

Two envelopesTwo envelopes: One contains $#, the other contains $2#.• Select an envelope and open it.• Before opening the envelope, think either equally good.

Is the following reasoning valid?• Let T = $ in envelope you selected.• Let U = $ in other envelope.

Follow-up: What happened byopening the envelope?

E F|G =1

2⋅ 2G =

Two envelopesTwo envelopes: One contains $#, the other contains $2#.• Select an envelope and open it.• Before opening the envelope, think either equally good.

Is the following reasoning valid?• Let T = $ in envelope you selected.• Let U = $ in other envelope.

Follow-up: What happened byopening the envelope?

E F|G =1

2⋅ 2G =

• Assumes all values of " (where 0 < " < ∞) equally likely

• Infinitely many values of "• So not a true probability distribution over "

(does not integrate to 1)

Are all values equally likely?

0 10 20 10060 8040

Infinite powers of two…

Two envelopes: The subjectivity of probabilityYour belief about the content of envelopes:• Since implied distribution over # is not a true probability distribution,

what is our distribution over #?

Frequentist

Play game infinitely many times, see how often different values come up Problem: you can only play game once

Bayesian

Have prior belief of distribution of $• Prior belief is a subjective

probability (as are all probabilities)• Can answer questions when

no/limited data• As we get more data, prior belief

“swamped” by data

Two envelopes: The subjectivity of probability

0 10 20 10060 8040

The envelope, pleaseBayesian: Have a prior distribution over #, J #• Let T = $ in envelope you selected. Open envelope to determine T.• Let U = $ in other envelope.

If G > E F|G , keep your envelope, otherwise switch. No inconsistency!!• Opening envelope provides data to compute J #|G

• …which allows you to compute E F|G

Of course, need to think about yourprior distribution over #…

Imagine if envelope you opened contained $20.01. Should you switch?

Bayesian probability: It doesn’t matter how you determine your prior, but you must have one (whatever it is)

How much is a half cent?

Have a wonderful Wednesday!

Lisa Yan, CS109, 2020 51

Lisa Yan, CS109, 2020 52

22: mapweb.stanford.edu/class/cs109/lectures/22_map.pdflisa yan, cs109, 2020 map in practice •flip...

Documents

networks in math & computer...

uship shipping estimator

contribution of membrane-binding and enzymatic domains...

15: general...

chiller estimator 3.0

chebyshev estimator

16: continuous joint distributions - stanford...

no slide titlebayes estimation • the map estimator is a...

18: central limit...

app cost estimator

hiring- trainee estimator

cs#109 lecture#20 may11th,# 2016 - stanford university ›...

kohonen self-organizing map as a software sensor …...

09-chebyshev estimator

07: random variables ii - stanford...

cost estimator

2014 national electrical estimator estimator

07: variance, bernoulli, binomial - stanford...

10: the normal (gaussian) distribution€¦ · lisa yan,...

electrical estimator manual