22: mapweb.stanford.edu/class/cs109/lectures/22_map.pdflisa yan, cs109, 2020 map in practice •flip...

Post on 20-Jul-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

22: MAPLisa YanMay 27, 2020

1

Lisa Yan, CS109, 2020

Quick slide reference

2

3 Maximum a Posteriori Estimator 22a_map

11 Bernoulli MAP: Choosing a prior 23b_bernoulli_any_prior

15 Bernoulli MAP: Conjugate prior 22c_bernoulli_conjugate

24 Common conjugate distributions LIVE

36 Choosing hyperparameters for conjugate prior LIVE

43 Bayesian Envelope demo LIVE

Maximum a Posteriori Estimator

3

22a_map

Lisa Yan, CS109, 2020

Maximum Likelihood Estimator

4

Maximum Likelihood Estimator

(MLE)

What is the parameter !that maximizes the likelihoodof our observed data "!, "", … , "# ?

!$%& = arg max'

+ "!, "", … , "#|!

- ! = + "!, "", … , "#|!

="!"#

$# $!|&

likelihood of dataObservations:• MLE maximizes probability of observing

data given a parameter !.• If we are estimating !, shouldn’t we

maximize the probability of ! directly?

Review

Today: Bayesian estimationusing the Bayesian definition of probability!

Consider a sample of " i.i.d. random variables #', #(, … , #) (data).

Lisa Yan, CS109, 2020

Maximum A Posterior (MAP) Estimator

5

Maximum Likelihood Estimator

(MLE)

What is the parameter !that maximizes the likelihoodof our observed data "!, "", … , "# ?

!$%& = arg max'

+ "!, "", … , "#|!

- ! = + "!, "", … , "#|!

="!"#

$# $!|&

Maximuma Posteriori

(MAP) Estimator

Given our observed data "!, "", … , "# ,

what is the most likely parameter !?

!$() = arg max'

+ !|"!, "", … , "#

likelihood of data

posterior distributionof &

Consider a sample of " i.i.d. random variables #', #(, … , #) (data).

Lisa Yan, CS109, 2020

Maximum A Posterior (MAP) EstimatorConsider a sample of " i.i.d. random variables #', #(, … , #) (data).def The Maximum a Posteriori (MAP) Estimator of ! is the value of ! that

maximizes the posterior distribution of !.

6

!*+, = arg max-

, !|#', #(, … , #)

Intuition with Bayes’ Theorem:

posteriorlikelihood prior

! " data = ! data " ! "! data

- ! , probability of data given parameter !

Before seeing data,prior belief of !

After seeing data, posterior

belief of !

Lisa Yan, CS109, 2020

Solving for !!"#• Observe data: "!, "", … , "#, all i.i.d. • Let likelihood be same as MLE:• Let the prior distribution on ! be . ! .

7

!$() = arg max'

+ !|"!, "", … , "# = arg max'

+ "!, "", … , "#|! . !ℎ "!, "", … , "#

(Bayes’ Theorem)

= arg max'

. ! ∏*+!# + "*| !

ℎ "!, "", … , "# (independence)

= arg max'

. ! 1*+!

#+ "*| ! (1/ℎ $#, $%, … , $$ is a positive constant w.r.t. &)

= arg max'

log . ! +5*+!

#log + "*| !

# $#, $%, … , $$|& ="!"#

$# $!| &

Lisa Yan, CS109, 2020

!!"#: Interpretation 1• Observe data: "!, "", … , "#, all i.i.d. • Let likelihood be same as MLE:• Let the prior distribution of ! be . ! .

8

!$() = arg max'

+ !|"!, "", … , "# = arg max'

+ "!, "", … , "#|! . !ℎ "!, "", … , "#

(Bayes’ Theorem)

= arg max'

. ! ∏*+!# + "*| !

ℎ "!, "", … , "# (independence)

= arg max'

. ! 1*+!

#+ "*| ! (1/ℎ $#, $%, … , $$ is a positive constant w.r.t. &)

= arg max'

log . ! +5*+!

#log + "*| !

# $#, $%, … , $$|& ="!"#

$# $!| &

!$() maximizes log prior + log-likelihood

Lisa Yan, CS109, 2020

!!"#: Interpretation 2• Observe data: "!, "", … , "#, all i.i.d. • Let likelihood be same as MLE:• Let the prior distribution of ! be . ! .

9

!$() = arg max'

+ !|"!, "", … , "# = arg max'

+ "!, "", … , "#|! . !ℎ "!, "", … , "#

(Bayes’ Theorem)

= arg max'

. ! ∏*+!# + "*| !

ℎ "!, "", … , "# (independence)

= arg max'

. ! 1*+!

#+ "*| ! (1/ℎ $#, $%, … , $$ is a positive constant w.r.t. &)

= arg max'

log . ! +5*+!

#log + "*| !

# $#, $%, … , $$|& ="!"#

$# $!| &

The mode of theposterior distribution of !

!$() maximizes log prior + log-likelihood

Lisa Yan, CS109, 2020

Mode: A statistic of a random variableThe mode of a random variable # is defined as:

• Intuitively: The value of # that is “most likely.”• Note that some distributions may not have a unique mode

(e.g., Uniform distribution, or Bernoulli(0.5))

10

arg max3

. / arg max3

, /($ discrete, PMF 4 5 )

($ continuous, PDF # 5 )

!$() is the most likely !given the data "!, "", … , "#.

!*+, = arg max-

, !|#', #(, … , #)

Bernoulli MAP: Choosing a prior

11

22b_bernoulli_any_prior

Lisa Yan, CS109, 2020

How does MAP work? (for Bernoulli)

12

Choose model Bernoulli .

Observe data

Choose prior on !

Find !$() =arg max

'+ !|"!, "", … , "#

" heads, 0 tails

(some . ! )

A lot of our effort in MAP depends on the . ! we choose.

maximizelog prior + log-likelihood

log . ! +5*+!

#log + "*|!

• Differentiate, set to 0• Solve

Lisa Yan, CS109, 2020

MAP for Bernoulli• Flip a coin 8 times. Observe 6 = 7 heads and 8 = 1 tail.• Choose a prior on !. What is !$()?

Suppose we pick a prior !~2 0.5, 1( . . ! = !", :

- .-/.1 !/"

13

1. Determine log prior + log likelihood

2. Differentiatew.r.t. (each) !, set to 0

3. Solve resultingequations

We should choose an ”easier” prior. This one is hard!

log . ! + log + "!, "", … , "#|!

= log12<

:- .-/.1 !/" + log 6 +86 =# 1 − = 3

= − log 2; − 4 − 0.5 %/2 + log @ + A@

+ @ log 4 + A log 1 − 4

− = − 0.5 +6= −

81 − = = 0

cubic equations why

Lisa Yan, CS109, 2020

A better approach: Use conjugate distributions

14

Choose model Bernoulli .

Observe data

Choose prior on !

Find !$() =arg max

'+ !|"!, "", … , "#

" heads, 0 tails

(some . ! ) (choose conjugatedistribution)

⭐maximize

log prior + log-likelihoodUp next: Conjugate

priors are greatfor MAP!

log . ! +5*+!

#log + "*|!

• Differentiate, set to 0• Solve

Bernoulli MAP: Conjugate prior

15

22c_bernoulli_conjugate

Lisa Yan, CS109, 2020

Beta is a conjugate distribution for BernoulliBeta is a conjugate distribution for Bernoulli, meaning:• Prior and posterior parametric forms are the same• Practically, conjugate means easy update:

Add numbers of “successes” and “failures” seen to Beta parameters.• You can set the prior to reflect how fair/biased you think the experiment is apriori.

Mode of Beta(C, D):

16

Prior

Posterior

Experiment Observe 6 successes and 8 failures

Beta(C = 6*345 + 1, D = 8*345 + 1)

Beta C = 6*345 + 6 + 1, D = 8*345 +8 + 1

C − 1C + D − 2

(we’ll prove this in a few minutes)

Review

Beta parameters B, C are called hyperparameters. Interpret Beta(B, C): B + C − 2 trials,

of which B − 1 are successes

Lisa Yan, CS109, 2020

How does MAP work? (for Bernoulli)

17

Choose model Bernoulli .

Observe data

Choose prior on !

Find !$() =arg max

'+ !|"!, "", … , "#

" heads, 0 tails

(some . ! ) (choose conjugatedistribution)

maximizelog prior + log-likelihood

log . ! +5*+!

#log + "*|!

• Differentiate, set to 0• Solve

Mode of posterior distribution of !

(posterior is alsoconjugate)

Lisa Yan, CS109, 2020

Conjugate strategy: MAP for Bernoulli• Flip a coin 8 times. Observe 6 = 7 heads and 8 = 1 tail.• Choose a prior on !. What is !$()?

18

1. Choose a prior

2. Determine posterior

3. Compute MAP

Suppose we pick a prior !~Beta C, D .

Because Beta is a conjugate distribution for Bernoulli,the posterior distribution is !|G~Beta C + 6, D + 8

Define as data, G

!$() =C + 6 − 1

C + 6 + D +8 − 2(mode of Beta B + @, C + A )

Lisa Yan, CS109, 2020

MAP in practice• Flip a coin 8 times. Observe 6 = 7 heads and 8 = 1 tail.• What is the MAP estimator of the Bernoulli parameter =,

if we assume a prior on = of Beta 2, 2 ?

19

!

Lisa Yan, CS109, 2020

MAP in practice• Flip a coin 8 times. Observe 6 = 7 heads and 8 = 1 tail.• What is the MAP estimator of the Bernoulli parameter =,

if we assume a prior on = of Beta 2, 2 ?

20

1. Choose a prior

2. Determine posterior

3. Compute MAP

!~Beta 2,2 .

Posterior distribution of ! given observed data is Beta 9, 3

!$() =810

Before flipping the coin, we imagined 2 trials:1 imaginary head, 1 imaginary tail.

After the coin, we saw 10 trials: 8 heads (imaginary and real),2 tails (imaginary and real).

Lisa Yan, CS109, 2020

Proving the mode of Beta

21

Choose model Bernoulli .

Observe data

Choose prior on !

Find !$() =arg max

'+ !|"!, "", … , "#

" heads, 0 tails

(some . ! ) (choose conjugate)Beta C, D

These are equivalent interpretations of !$().

We’ll use this equivalence to prove the mode of Beta.

Mode of posterior distribution of !

maximizelog prior + log-likelihood

log . ! +5*+!

#log + "*|!

• Differentiate, set to 0• Solve

(posterior is alsoconjugate)

Lisa Yan, CS109, 2020

From first principles: MAP for Bernoulli, conjugate prior• Flip a coin 8 times. Observe 6 = 7 heads and 8 = 1 tail.• Choose a prior on !. What is !$()?

Suppose we pick a prior !~Beta 7, 8 . . ! = !6 =

4-! 1 − = 7-!

22

1. Determine log prior + log likelihood

2. Differentiatew.r.t. (each) !, set to 0

3. Solve

log . ! + log + "!, "", … , "#|! = log1K =

4-! 1 − = 7-! + log 6 +86 =# 1 − = 3

= log1K + C − 1 log = + D − 1 log 1 − = + log 6 +8

6 + 6 log = +8 log 1 − =

C − 1= +

6= −

D − 11 − = −

81 − = = 0

(next slide)

normalizing constant, 8

Lisa Yan, CS109, 2020

From first principles: MAP for Bernoulli, conjugate prior• Flip a coin 8 times. Observe 6 = 7 heads and 8 = 1 tail.• Choose a prior !. What is !$()?

Suppose we pick a prior !~Beta 7, 8 . . ! = = = !6 =

4-! 1 − = 7-!

23

3. Solve for =

normalizing constant, 8

C − 1= +

6= −

D − 11 − = −

81 − = = 0

!$() =C + 6 − 1

C + 6 + D +8 − 2The mode of the posterior,Beta B + @, C + A !

If we choose a conjugate prior, we avoid calculus with MAP: just report mode of posterior.

(from previous slide)

C + 6 − 1= −

D +8 − 11 − = = 0⟹

(live)22: MAPLisa YanMay 27, 2020

24

Lisa Yan, CS109, 2020

How does MAP work?

25

Choose model with parameter !

Observe data

Choose prior on !

Find !$() = arg max'

+ !|"!, "", … , "#Mode of posterior distribution of !

maximizelog prior + log-likelihood

Review

or

If we choose a conjugate prior, we avoid calculus with MAP: just report mode of posterior.

= arg max'

log . ! +5*+!

#log + "*| !

Two valid interpretations of !$()

Conjugate distributions

26

LIVE

Lisa Yan, CS109, 2020

Quick MAP for Bernoulli and BinomialBeta 7, 8 is a conjugate prior for theprobability of success in theBernoulli and Binomial distributions.

Prior Beta 7, 8Saw 7 + 8 − 2 imaginary trials: 7 − 1 successes, 8 − 1 failures

Experiment Observe " +0 new trials: " successes, 0 failures

Posterior Beta 7 + ", 8 + 0

27

MAP: = =C + 6 − 1

C + D + 6 +8 − 2

+ C =1

L C, D M4-! 1 − M 7-!

Review

Lisa Yan, CS109, 2020

Conjugate distributionsMAPestimator:

28

!*+, = arg max-

, !|#', #(, … , #)The mode of theposterior distribution of !

Distribution parameter Conjugate distributionBernoulli 4 BetaBinomial 4 BetaMultinomial 4! DirichletPoisson F GammaExponential F GammaNormal G NormalNormal H% Inverse Gamma

Don’t need to know Inverse Gamma…but it will know you J

CS109: We’ll only focus on MAP for Bernoulli/Binomial =, Multinomial =*, and Poisson N.

Lisa Yan, CS109, 2020

Multinomial is Multiple times the funDirichlet 7', 7(, … , 7I is a conjugate for Multinomial.• Generalizes Beta in the

same way Multinomialgeneralizes Bernoulli/Binomial:

Prior Dirichlet 7', 7(, … , 7ISaw ∑JK'

I 7J −0 imaginary trials, with 7J − 1 of outcome >

Experiment Observe "' + "( +⋯+ "I new trials, with "J of outcome >

Posterior Dirichlet 7' + "', 7( + "(, … , 7I + "I

29

MAP: =* =C* + 6* − 1

∑*+!3 C* + ∑*+!3 6* −8

+ M!, M", … , M3 =1

L C!, C", … , C31*+!

3M*4"-!

Lisa Yan, CS109, 2020

Good times with GammaGamma @, A is a conjugate for Poisson.• Also conjugate for Exponential,

but we won’t delve into that• Mode of gamma: @ − 1 /A

Prior !~Gamma @, ASaw @ − 1 total imaginary events during A prior time periods

Experiment Observe " events during next C time periods

Posterior !|" events in C periods ~Gamma @ + ", A + C

30

!$() =C + 6 − 1K + P

MAP:

Lisa Yan, CS109, 2020

MAP for PoissonLet D be the average # of successes in a time period.

1. What does it mean to havea prior of !~Gamma 11,5 ?

Now perform the experiment and see 11 events in next 2 time periods.2. Given your prior, what is the

posterior distribution?

3. What is !*+,?

31

Gamma L, Mis conjugate for Poisson

Observe 10 imaginary eventsin 5 time periods,i.e., observe at Poisson rate = 2

!

Mode: &'#(

Lisa Yan, CS109, 2020

MAP for PoissonLet D be the average # of successes in a time period.

1. What does it mean to havea prior of !~Gamma 11,5 ?

Now perform the experiment and see 11 events in next 2 time periods.2. Given your prior, what is the

posterior distribution?

3. What is !*+,?

32

Gamma L, Mis conjugate for Poisson Mode: &'#(

Observe 10 imaginary eventsin 5 time periods,i.e., observe at Poisson rate = 2

!|6 events in P periods ~Gamma 22, 7

!$() = 3, the updated Poisson rate

Interlude for jokes/announcements

33

Lisa Yan, CS109, 2020

Announcements

34

Problem Set 6: No late days or on-time bonus

Quiz 2 Errata

Question 1(d) full credit for everyone

Final course grade progress

To be released after class

Lisa Yan, CS109, 2020

Interesting probability news

35

CS109 Current Events Spreadsheet

What Role Should Employers Play in Testing Workers?

https://www.nytimes.com/2020/05/22/business/employers-coronavirus-testing.html

One nascent strategy circulating among public health experts is running “pooled” coronavirus tests, in which a workplace could combine multiple saliva or nasal swabs into one larger sample representing dozens of employees.

Remember Quiz 1??

Choosing hyperparameters for conjugate prior

36

LIVE

Lisa Yan, CS109, 2020

Where’d you get them priors?• Let ! be the probability a coin turns up heads.• Model ! with 2 different priors:◦ Prior 1: Beta(3,8): 2 imaginary heads,

7 imaginary tails◦ Prior 2: Beta(7,4): 6 imaginary heads,

3 imaginary tails

Now flip 100 coins and get 58 heads and 42 tails.1. What are the two posterior distributions?2. What are the modes of the two posterior distributions?

37

mode: "9

mode: :9

prior

!

Lisa Yan, CS109, 2020

Where’d you get them priors?• Let ! be the probability a coin turns up heads.• Model ! with 2 different priors:◦ Prior 1: Beta(3,8): 2 imaginary heads,

7 imaginary tails◦ Prior 2: Beta(7,4): 6 imaginary heads,

3 imaginary tails

Now flip 100 coins and get 58 heads and 42 tails.

38

mode: "9

mode: :9

prior

posterior

Posterior 1: Beta(61,50) mode: :/!/9

mode: :;!/9Posterior 2: Beta(65,46)

As long as we collect enough data,posteriors will converge to the true value.

Lisa Yan, CS109, 2020

MAP with Laplace smoothing: a prior which represents C imagined observations of each outcome.

• Categorical data (i.e., Multinomial, Bernoulli/Binomial)• Also known as additive smoothing

Laplace estimate Imagine C = 1 of each outcome(follows from Laplace’s “law of succession”)

Example: Laplace estimate for coin probabilities from aforementionedexperiment (100 coins: 58 heads, 42 tails)

Laplace smoothing

39

59102

43102

heads tailsLaplace smoothing:• Easy to implement/remember

Lisa Yan, CS109, 2020

Consider our previous 6-sided die.• Roll the dice 6 = 12 times.• Observe: 3 ones, 2 twos, 0 threes, 3 fours, 1 fives, 3 sixes

Recall !*NO :

What are your Laplace estimates for each roll outcome?

Back to our happy Laplace

40

=! = 3/12, =" = 2/12, =< = 0/12,=; = 3/12, =1 = 1/12, =: = 3/12

!

Lisa Yan, CS109, 2020

Consider our previous 6-sided die.• Roll the dice 6 = 12 times.• Observe: 3 ones, 2 twos, 0 threes, 3 fours, 1 fives, 3 sixes

Recall !*NO :

What are your Laplace estimates for each roll outcome?

Back to our happy Laplace

41

=! = 3/12, =" = 2/12, =< = 0/12,=; = 3/12, =1 = 1/12, =: = 3/12

=* ="* + 16 +8

=! = 4/18, =" = 3/18, =< = 1/18,=; = 4/18, =1 = 2/18, =: = 4/18

Laplace smoothing:• Easy to implement/remember• Avoids estimating a parameter of S

Bayesian Envelope Demo

42

LIVE

Lisa Yan, CS109, 2020

Two envelopesTwo envelopes: One contains $#, the other contains $2#.• Select an envelope and open it.• Before opening the envelope, think either equally good.

Is the following reasoning valid?• Let T = $ in envelope you selected.• Let U = $ in other envelope.

Follow-up: What happened byopening the envelope?

43

E F|G =1

2⋅G

2+1

2⋅ 2G =

5

4G

!

Lisa Yan, CS109, 2020

Two envelopesTwo envelopes: One contains $#, the other contains $2#.• Select an envelope and open it.• Before opening the envelope, think either equally good.

Is the following reasoning valid?• Let T = $ in envelope you selected.• Let U = $ in other envelope.

Follow-up: What happened byopening the envelope?

44

E F|G =1

2⋅G

2+1

2⋅ 2G =

5

4G

• Assumes all values of " (where 0 < " < ∞) equally likely

• Infinitely many values of "• So not a true probability distribution over "

(does not integrate to 1)

Lisa Yan, CS109, 2020

Are all values equally likely?

45

0 10 20 10060 8040

Infinite powers of two…

Y"

"

Lisa Yan, CS109, 2020

Two envelopes: The subjectivity of probabilityYour belief about the content of envelopes:• Since implied distribution over # is not a true probability distribution,

what is our distribution over #?

46

Frequentist

Play game infinitely many times, see how often different values come up Problem: you can only play game once

Bayesian

Have prior belief of distribution of $• Prior belief is a subjective

probability (as are all probabilities)• Can answer questions when

no/limited data• As we get more data, prior belief

“swamped” by data

Lisa Yan, CS109, 2020

Two envelopes: The subjectivity of probability

47

0 10 20 10060 8040

Y"

"

Lisa Yan, CS109, 2020

The envelope, pleaseBayesian: Have a prior distribution over #, J #• Let T = $ in envelope you selected. Open envelope to determine T.• Let U = $ in other envelope.

If G > E F|G , keep your envelope, otherwise switch. No inconsistency!!• Opening envelope provides data to compute J #|G

• …which allows you to compute E F|G

Of course, need to think about yourprior distribution over #…

Imagine if envelope you opened contained $20.01. Should you switch?

48

Bayesian probability: It doesn’t matter how you determine your prior, but you must have one (whatever it is)

Lisa Yan, CS109, 2020

How much is a half cent?

49

Lisa Yan, CS109, 2020

Have a wonderful Wednesday!

50

Lisa Yan, CS109, 2020 51

Lisa Yan, CS109, 2020 52

top related