applied bayesian inference, ksu, april 29, 2012 § ❷ / §❷ an introduction to bayesian inference...

Applied Bayesian Inference, KSU, April 29, 2012

§❷/ 1

§❷ An Introduction to Bayesian inference

Robert J. Tempelman


§❷/ 2

Bayes Theorem

• Recall basic axiom of probability:– f(q,y) = f(y|q) f(q)

• Also– f(q,y) = f(q|y) f(y)

• Combine both expressions to get:

or

Posterior Likelihood * Prior

||y θ

θ yy

θf

f f

f

||θ y θy θff f


§❷/

Prior densities/distributions• What can we specify for ?

– Anything that reflects our prior beliefs.– Common choice: “conjugate” prior.

• is chosen such that is recognizeable and of same form.

– “Flat” prior: . Then

– flat priors can be dangerous…can lead to improper ; i.e.

θf

|θ yf θf

constantθf | |

| |

θ y y θ θ

y θ y θ

f f f

f constant f

|θ yf |θ

θ y θf d

3


§❷/ 4

Prior information / Objective?

• Introducing prior information may somewhat "bias" sample information; nevertheless, ignoring existing prior information is inconsistent with – 1) human rational behavior – 2) nature of the scientific method. – Memory property: past inference (posterior) can be

used as updated prior in future inference.• Nevertheless, many applied Bayesian data analysts

try to be as “objective” as possible using diffuse (e.g., flat) priors.


§❷/ 5

Example of conjugate prior

• Recall the binomial distribution:

• Suppose we express prior belief on p using a beta distribution:

– Denoted as Beta(a,b)

!Prob | , (1 )

!( )!y n yn

n pY yn

p py y

1 1(1| , )pf p p


§❷/ 6

Examples of different beta densities

0.0 0.2 0.4 0.6 0.8 1.0

02

46

8

p

Be

ta D

en

sitie

s

=9,=1=1,=1=2,=18

| ,pE

2var | ,

1p

Diffuse (flat) bounded prior(but it is proper since it is bounded!)


§❷/ 7

Posterior density of p

• Posterior Likelihood * Prior

• i.e. Beta(y+a,n-y+b)

• Beta is conjugate to the Binomial

1

1 1

1(1 ) (1 )

| , , , Prob | , | ,

(1 )

y n y

y n y

f p n y n f

p p

Y y

p

p p

p p

p


§❷/ 8

Suppose we observe data

• y = 10, n = 15.• Consider

three alternative priors:– Beta(1,1)– Beta(9,1)– Beta(2,18)

0.0 0.2 0.4 0.6 0.8 1.0

01

23

4

p

Be

ta D

en

sitie

s=19,=6=11,=6=12,=23

Posterior densities:Beta(y+a,n-y+b)


§❷/ 9

Suppose we observed a larger dataset

• y = 100, n = 150.• Consider same alternative priors:

– Beta(1,1)– Beta(9,1)– Beta(2,18)

Posterior densities

0.0 0.2 0.4 0.6 0.8 1.0

02

46

81

0

p

Be

ta D

en

sitie

s

=109,=51=101,=51=102,=68


§❷/ 10

Posterior information

• Given:

• Posterior information = likelihood information + prior information.

• One option for point estimate: joint posterior mode of q using Newton Raphson.– Also called MAP (maximum a posteriori) estimate of q.

fff || yy

ln | constant+ ln | lnθ y y θ θf f f

'

ln

'

|ln

'

|ln 222

fff yy


§❷/ 11

Recall the plant genetic linkage example

• Recall

Suppose• Then

1 2 3 4

1 2 3 4

! 2 1 1|

! ! ! ! 4 4 4 4y

y y y yn

py y y y

1 1| , (1 )f

1 2 3 4

1 2 3 4

1 2 3 4

1 1

1 1

1 1

| , , | | ,

2 1(1 )

4 4 4

2 1 (1 )

2 1

y yy y y y

y y y y

y y y y

f p f

Almost as if you increased the number of plants in genotypes 2 and 3 by b-1…in genotype 4 by a-1.


§❷/ 12

Plant linkage example cont’d.Suppose data newton; y1 = 1997; y2 = 906; y3 = 904; y4 = 32; alpha = 50; beta=500; theta = 0.01; /* try starting value of 0.50 too */ do iterate = 1 to 10; logpost = y1*log(2+theta) + (y2+y3+beta-1)*log(1-theta) + (y4+alpha-1)*log(theta); firstder = y1/(2+theta) - (y2+y3+beta-1)/(1-theta) + (y4+alpha-1)/theta; secndder = (-y1/(2+theta)**2 - (y2+y3+beta-1)/(1-theta)**2 - (y4+alpha-1)/theta**2); theta = theta + firstder/(-secndder); output; end; asyvar = 1/(-secndder); /* asymptotic variance of theta_hat at convergence */ poststd = sqrt(asyvar); call symputx("poststd",poststd); output;run;title "Posterior Standard Error = &poststd";proc print; var iterate theta logpost;run;

| , 50, 500Beta

Posterior standard error

1

2

2ˆ

ln ||

yy

fsd f


§❷/ 13

OutputPosterior Standard Error = 0.0057929339

Obs iterate theta logpost

1 1 0.018318 997.95

2 2 0.030841 1035.74

3 3 0.044771 1060.65

4 4 0.053261 1071.06

5 5 0.054986 1072.79

6 6 0.055037 1072.84

7 7 0.055037 1072.84

8 8 0.055037 1072.84

9 9 0.055037 1072.84

10 10 0.055037 1072.84

11 11 0.055037 1072.84

Posterior Standard Error = 0.0057929339


§❷/ 14

Additional elements of Bayesian inference

• Suppose that q can be partitioned into two components, a px1 vector q 1 and a qx1 vector q2,

• If want to make probability statements about q, use probability calculus:

• There is NO repeated sampling concept.– Condition on one observed dataset.– However, Bayes estimators typically do have very good

frequentist properties!

2

1

dpob yy ||Pr


§❷/ 15

Marginal vs. conditional inference

• Suppose you’re primarily interested in q1:

– i.e. average over uncertainty on q2 (nuisance variables)

• Of course, if q2 was known, you would condition your inference on q1 accordingly:

y,yy,

yyy

y21

|2221

22121

|E||

|,||

2

2

22

pdpp

dpdpp

R

RR

y,21 | p


§❷/ 16

Two-stage model example

• Given with yi ~ NIID (m, s2) where s2 is known. Wish to infer m. From Bayes theorem:

nyyy 21'y

2 22| , , | ,y y| a af f f

2,~ aaN

2

22

2

1exp

2

1,| a

aa

aaf

Suppose

i.e.


§❷/ 17

Simplify likelihood

2 2

1

/2/2 222

1

, ,

12 exp

2

y| |n

ii

nnn

ii

f f y

y

2

21

1exp

2

n

ii

yy y

2

2

2

1

1exp 2

2

n

i ii

y y yy y y

n

i

y1

2

22

1exp

2

2

2

1exp y

n


§❷/ 19

Interpretation of Posterior Density with Flat Prior

• So

• Then

• i.e.

222 ,,,| |y|yy ffff

22 ,,|

|yy fArgMaxfArgMax

2 2Posterior mode | , ML | ,y y y


§❷/ 20

Posterior density with informative prior

• Now

After algebraic simplication:

n

yf

a

aaa 2

2

2

222

2

1exp,,,|

y

n

n

n

ny

Nf

a

a

a

aa

aa 22

22

22

2

22

22 ~,~~,,,|

y


§❷/ 21

• Note that

a

a

a

a

a

aa

n

n

y

nn

ny

2

2

2

2

2

222

22

11

1~

12

1 12 2

12

1 12 2

a

a a

a

n

n n

y

122

122

2

2

121

a

a

a

n

n

n

Posterior precision = prior precision + sample (likelihood) precisioni.e., weighted average of data

mean and prior mean


§❷/ 22

Hierarchical models

• Given

• Two stage:

• Three stage:

– What’s the difference? When do you consider one over another?

2

1

1 2 1 1 2| , | |θ y θ y θ θ θp p p

1 2 1 1 2 2, | | |θ θ y y θ θ θ θp p p p


§❷/ 23

Simple hierarchical model

• Random effects model– Yij = m + ai + eij

m: overall mean, ai ~ NIID(0,t2) ; eij ~ NIID(0,s2).

Suppose we knew m , s2, and t2:

| 1yi iBE y

2

| 1yi BVarn

2

22

nB

n

Shrinkage factor


§❷/ 24

What if we don’t know m , s2, or t2?

• Option 1: Estimate them:

• Then “plug them” in.

• Not truly Bayesian.– Empirical Bayesian (EB) (next section).– Most of us using PROC MIXED/GLIMMIX are EB!

k

yy

k

ii

1̂

)1(ˆ ,

2

2

kn

yyji

iij

2

2

2ˆ

1ˆ

ii

n y y

kn

ˆ ˆ| 1 ˆyi iBE y 2ˆ

| 1 ˆyi BVarn

e.g.method of moments


§❷/ 25

A truly Bayesian approach

• 1) Yij|qi ~ N(qi,s2) ; for all i,j

• 2) q1, q2, …, qk are iid N(m, t2)o Structural prior (exchangeable entities)

• 3) m ~ p(m); t2~ p(t2); s2 ~ p(s2)o Subjective prior

22

1 1

2221 |||,,,,...,, ppppypp

k

ii

n

jiijak

y

2 21 2

2 21 2 1 1

, ,..., , , , |... ...

,..., ,...,

yk a

i i k

p

d d d d d d d d

y|ip

Fully Bayesian inference (next section after that!)

applied bayesian inference, ksu, april 29, 2012 § ❷ / §❷ an introduction to bayesian inference...

Documents

past inference posterior

theta secndder

bayesian inferencerobert

future inference

applied bayesian data

theta logpostrun

prior information objective

existing prior information