kevin p. murphy - conjugate gaussian analysis bayes

Conjugate Bayesian analysis of the Gaussian distributionKevin P. Murphy

[email protected]

Last updated October 3, 2007

1 IntroductionThe Gaussian or normal distribution is one of the most widely used in statistics. Estimating its parameters usingBayesian inference and conjugate priors is also widely used. The use of conjugate priors allows all the results to bederived in closed form. Unfortunately, different books use different conventions on how to parameterize the variousdistributions (e.g., put the prior on the precision or the variance, use an inverse gamma or inverse chi-squared, etc),which can be very confusing for the student. In this report, we summarize all of the most commonly used forms. Weprovide detailed derivations for some of these results; the rest can be obtained by simple reparameterization. See theappendix for the definition the distributions that are used.

2 Normal priorLet us consider Bayesian estimation of the mean of a univariate Gaussian, whose variance is assumed to be known.(We discuss the unknown variance case later.)2.1 LikelihoodLet D = (x1, . . . , xn) be the data. The likelihood is

p(D|, 2) =ni=1

p(xi|, 2) = (2pi2)n/2 exp{ 122

ni=1

(xi )2}

(1)

Let us define the empirical mean and variance

x =1

n

ni=1

xi (2)

s2 =1

n

ni=1

(xi x)2 (3)

(Note that other authors (e.g., [GCSR04]) define s2 = 1n1n

i=1(xi x)2.) We can rewrite the term in the exponentas follows

i

(xi )2 =i

[(xi x) ( x)]2 (4)

=i

(xi x)2 +i

(x )2 2i

(xi x)( x) (5)

= ns2 + n(x )2 (6)since

i

(xi x)( x) = ( x)((i

xi) nx)

= ( x)(nx nx) = 0 (7)

Thanks to Hoyt Koepke for proof reading.

1

Hence

p(D|, 2) = 1(2pi)n/2

1

nexp

( 122

[ns2 + n(x )2]) (8)

(

1

2

)n/2exp

( n22

(x )2)exp

(ns

2

22

)(9)

If 2 is a constant, we can write this as

p(D|) exp( n22

(x )2) N (x|,

2

n) (10)

since we are free to drop constant factors in the definition of the likelihood. Thus n observations with variance 2 andmean x is equivalent to 1 observation x1 = x with variance 2/n.2.2 PriorSince the likelihood has the form

p(D|) exp( n22

(x )2) N (x|,

2

n) (11)

the natural conjugate prior has the form

p() exp( 1220

( 0)2) N (|0, 20) (12)

(Do not confuse 20 , which is the variance of the prior, with 2, which is the variance of the observation noise.) (Anatural conjugate prior is one that has the same form as the likelihood.)2.3 PosteriorHence the posterior is given by

p(|D) p(D|, )p(|0, 20) (13)

exp[ 122

i

(xi )2] exp

[ 1220

( 0)2]

(14)

= exp

[122

i

(x2i + 2 2xi) + 1

220(2 + 20 20)

](15)

Since the product of two Gaussians is a Gaussian, we will rewrite this in the form

p(|D) exp[

2

2

(1

20+

n

2

)+

(020

+

i xi2

)(20220

+

i x

2i

22

)](16)

def= exp

[ 122n

(2 2n + 2n)]= exp

[ 122n

( n)2]

(17)

Matching coefficients of 2, we find 2n is given by

222n

=22

(1

20+

n

2

)(18)

1

2n=

1

20+

n

2(19)

2n =220

n20 + 2=

1n2 +

120

(20)

2

N = 0

N = 1

N = 2

N = 10

1 0 10

5

Figure 1: Sequentially updating a Gaussian mean starting with a prior centered on 0 = 0. The true parameters are = 0.8(unknown), (2) = 0.1 (known). Notice how the data quickly overwhelms the prior, and how the posterior becomes narrower.Source: Figure 2.12 [Bis06].

Matching coefficients of we get

2n22n

=

(ni=1 xi2

+020

)(21)

n2n

=

ni=1 xi2

+020

(22)

=20nx+

20220

(23)

Hence

n =2

n20 + 20 +

n20n20 +

2x = 2n

(020

+nx

2

)(24)

This operation of matching first and second powers of is called completing the square.Another way to understand these results is if we work with the precision of a Gaussian, which is 1/variance (high

precision means low variance, low precision means high variance). Let

= 1/2 (25)0 = 1/

20 (26)

n = 1/2n (27)

Then we can rewrite the posterior as

p(|D,) = N (|n, n) (28)n = 0 + n (29)n =

xn+ 00n

= wML + (1 w)0 (30)

3

5 0 50

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4

0.45prior sigma 10.000

priorlikpost

5 0 50

0.1

0.2

0.3

0.4

0.5

0.6

0.7prior sigma 1.000

priorlikpost

(a) (b)

Figure 2: Bayesian estimation of the mean of a Gaussian from one sample. (a) Weak prior N (0, 10). (b) Strong prior N (0, 1). Inthe latter case, we see the posterior mean is shrunk towards the prior mean, which is 0. Figure produced by gaussBayesDemo.

where nx =n

i=1 xi and w = nn . The precision of the posterior n is the precision of the prior 0 plus onecontribution of data precision for each observed data point. Also, we see the mean of the posterior is a convexcombination of the prior and the MLE, with weights proportional to the relative precisions.

To gain further insight into these equations, consider the effect of sequentially updating our estimate of (seeFigure 1). After observing one data point x (so n = 1), we have the following posterior mean

1 =2

2 + 200 +

202 + 20

x (31)

= 0 + (x 0) 20

2 + 20(32)

= x (x 0) 2

2 + 20(33)

The first equation is a convex combination of the prior and MLE. The second equation is the prior mean ajustedtowards the data x. The third equation is the data x adjusted towads the prior mean; this is called shrinkage. Theseare all equivalent ways of expressing the tradeoff between likelihood and prior. See Figure 2 for an example.2.4 Posterior predictiveThe posterior predictive is given by

p(x|D) =p(x|)p(|D)d (34)

=

N (x|, 2)N (|n, 2n)d (35)

= N (x|n, 2n + 2) (36)This follows from general properties of the Gaussian distribution (see Equation 2.115 of [Bis06]). An alternative proofis to note that

x = (x ) + (37)x N (0, 2) (38)

N (n, 2n) (39)Since E[X1 +X2] = E[X1] +E[X2] and Var [X1 +X2] = Var [X1] +Var [X2] if X1, X2 are independent, we have

X N (n, 2n + 2) (40)

4

since we assume that the residual error is conditionally independent of the parameter. Thus the predictive variance isthe uncertainty due to the observation noise 2 plus the uncertainty due to the parameters, 2n.2.5 Marginal likelihoodWriting m = 0 and 2 = 20 for the hyper-parameters, we can derive the marginal likelihood as follows:

` = p(D|m,2, 2) =[ni=1

N (xi|, 2)]N (|m, 2)d (41)

=

(2pi)n

n2 + 2

exp

(

i x2i

22 m

2

22

)exp

(2n2x22 +

2m2

2 + 2nxm

2(n2 + 2)

)(42)

The proof is below, based on the on the appendix of [DMP+06].We have

` = p(D|m,2, 2) =[ni=1

N (xi|, 2)]N (|m, 2)d (43)

=1

(2pi)n(

2pi)

exp

( 122

i

(xi )2 122

(m)2)d (44)

Let us define S2 = 1/2 and T 2 = 1/2. Then

` =1

(2pi/S)n(

2pi/T )

exp

(S

2

2(i

x2i + n2 2

i

xi) T2

2(2 +m2 2m)

)d (45)

= c

exp

( 12 (S2n2 2S2

i

xi + T22 2T 2m)

)d (46)

where

c =exp

( 12 (S2i x2i + T 2m2))(2pi/S)n(

2pi/T )

(47)

So

` = c

exp

[ 12 (S2n+ T 2)

(2 2S

2

i xi + T2m

S2n+ T 2

)]d (48)

= c exp

((S2nx+ T 2m)2

2(S2n+ T 2)

)exp

[ 12 (S2n+ T 2)

( S

2nx+ T 2m

S2n+ T 2

)2]d (49)

= c exp

((S2nx+ T 2m)2

2(S2n+ T 2)

) 2pi

S2n+ T 2(50)

=exp

( 12 (S2i x2i + T 2m2))(2pi/S)n(

2pi/T )

exp

((S2nx+ T 2m)2

2(S2n+ T 2)

) 2pi

S2n+ T 2(51)

Now1

(2pi)/T

2pi

S2n+ T 2=

N2 + 2

(52)

and

(nx2 +m2 )

2

2( n2 +12 )

=(nx2 +m2)2

222(n2 + 2)(53)

=n2x22/2 + 2m2/2 + 2nxm

2(n2 + 2)(54)

5

So

p(D) = (2pi)n

n2 + 2

exp

(

i x2i

22 m

2

22

)exp

(2n2x22 +

2m2

2 + 2nxm

2(n2 + 2)

)(55)

To check this, we should ensure that we get

p(x|D) = p(x,D)p(D)

= N (x|n, 2n + 2) (56)

(To be completed)2.6 Conditional prior p(|2)Note that the previous prior is not, strictly speaking, conjugate, since it has the form p() whereas the posterior hasthe form p(|D,), i.e., occurs in the posterior but not the prior. We can rewrite the prior in conditional form asfollows

p(|) = N (|0, 2/0) (57)This means that if 2 is large, the variance on the prior of is also large. This is reasonable since 2 defines themeasurement scale of x, so the prior belief about is equivalent to 0 observations of 0 on this scale. (Hence anoninformative prior is 0 = 0.) Then the posterior is

p(|D) = N (|n, 2/n) (58)where n = 0 + n. In this form, it is clear that 0 plays a role analogous to n. Hence 0 is the equivalent samplesize of the prior.2.7 Reference analysisTo get an uninformative prior, we just set the prior variance to infinity to simulate a uniform prior on .

p() 1 = N (|,) (59)p(|D) = N (|x, 2/n) (60)

3 Normal-Gamma priorWe will now suppose that both the mean m and the precision = 2 are unknown. We will mostly follow thenotation in [DeG70, p169].3.1 LikelihoodThe likelihood can be written in this form

p(D|, ) = 1(2pi)n/2

n/2 exp

(2

ni=1

(xi )2)

(61)

=1

(2pi)n/2n/2 exp

(2

[n( x)2 +

ni=1

(xi x)2])

(62)

3.2 PriorThe conjugate prior is the normal-Gamma:

NG(, |0, 0, 0, 0) def= N (|0, (0)1)Ga(|0, rate = 0) (63)=

1

ZNG(0, 0, 0, 0)12 exp(0

2( 0)2)01e0 (64)

=1

ZNG0

12 exp

(2

[0( 0)2 + 20

]) (65)ZNG(0, 0, 0, 0) =

(0)

00

(2pi

0

)12

(66)

6

20

2

02

40

0.2

0.4

NG(=2.0, a=1.0, b=1.0)

20

2

02

40

0.2

0.4

NG(=2.0, a=3.0, b=1.0)

20

2

02

40

0.2

0.4

NG(=2.0, a=5.0, b=1.0)

20

2

02

40

0.2

0.4

NG(=2.0, a=5.0, b=3.0)

Figure 3: Some Normal-Gamma distributions. Produced by NGplot2.

See Figure 3 for some plots.We can compute the prior marginal on as follows:

p() 0

p(, )d (67)

=

0

0+121 exp

((0 + 0( 0)

2

2)

)d (68)

We recognize this as an unnormalized Ga(a = 0 + 12 , b = 0 +0(0)

2

2 ) distribution, so we can just write down

p() (a)ba

(69) ba (70)= (0 +

02( 0)2)0

12 (71)

= (1 +1

20

00( 0)20

)(20+1)/2 (72)

which we recognize as as a T20(|0, 0/(00)) distribution.

7

3.3 PosteriorThe posterior can be derived as follows.

p(, |D) NG(, |0, 0, 0, 0)p(D|, ) (73) 12 e(0(0)2)/201e0 n/2e2

Pni=1(xi)

2 (74) 120+n/21e0e(/2)[0(0)2+

Pi(xi)

2] (75)From Equation 6 we have

ni=1

(xi )2 = n( x)2 +ni=1

(xi x)2 (76)

Also, it can be shown that

0( 0)2 + n( x)2 = (0 + n)( n)2 + 0n(x 0)2

0 + n(77)

wheren =

00 + nx

0 + n(78)

Hence

0( 0)2 +i

(xi )2 = 0( 0)2 + n( x)2 +i

(xi x)2 (79)

= (0 + n)( n)2 + 0n(x 0)2

0 + n+i

(xi x)2 (80)

So

p(, |D) 12 e(/2)(0+n)(n)2 (81)0+n/21e0e(/2)

Pi(xix)2e(/2)

0n(x0)20+n (82)

N (|n, ((+ n))1)Ga(|0 + n/2, n) (83)where

n = 0 +12

ni=1

(xi x)2 + 0n(x 0)2

2(0 + n)(84)

In summary,

p(, |D) = NG(, |n, n, n, n) (85)n =

00 + nx

0 + n(86)

n = 0 + n (87)n = 0 + n/2 (88)

n = 0 +12

ni=1

(xi x)2 + 0n(x 0)2

2(0 + n)(89)

We see that the posterior sum of squares, n, combines the prior sum of squares, 0, the sample sum of squares,i(xi x)2, and a term due to the discrepancy between the prior mean and sample mean. As can be seen from

Figure 3, the range of probable values for and 2 can be quite large even after for moderate n. Keep this picture inmind whenever someones claims to have fit a Gaussian to their data.

8

3.3.1 Posterior marginalsThe posterior marginals are (using Equation 72)

p(|D) = Ga(|n, n) (90)p(|D) = T2n(|n, n/(nn)) (91)

3.4 Marginal likelihoodTo derive the marginal likelihood, we just dererive the posterior, but this time we keep track of all the constant factors.LetNG(, |0, 0, 0, 0) denote an unnormalized Normal-Gamma distribution, and letZ0 = ZNG(0, 0, 0, 0)be the normalization constant of the prior; similarly let Zn be the normalization constant of the posterior. LetN (xi|, ) denote an unnormalized Gaussian with normalization constant 1/

2pi. Then

p(, |D) = 1p(D)

1

Z0NG(, |0, 0, 0, 0)

(1

2pi

)n/2i

N (xi|, ) (92)

The NG and N terms combine to make the posterior NG:

p(, |D) = 1Zn

NG(, |n, n, n, n) (93)

Hence

p(D) =ZnZ0

(2pi)n/2 (94)

=(n)

(0)

00nn

(0n

)12 (2pi)n/2 (95)

3.5 Posterior predictiveThe posterior predictive for m new observations is given by

p(Dnew|D) = p(Dnew, D)p(D)

(96)

=Zn+mZ0

(2pi)(n+m)/2Z0Zn

(2pi)n/2 (97)

=Zn+mZn

(2pi)m/2 (98)

=(n+m)

(n)

nnn+mn+m

(n

n+m

)12(2pi)m/2 (99)

In the special case that m = 1, it can be shown (see below) that this is a T-distribution

p(x|D) = t2n(x|n,n(n + 1)

nn) (100)

To derive the m = 1 result, we proceed as follows. (This proof is by Xiang Xuan, and is based on [GH94, p10].)When m = 1, the posterior parameters are

n+1 = n + 1/2 (101)n+1 = n + 1 (102)

n+1 = n +1

2

1i=1

(xi x)2 + n(x n)2

2(n + 1)(103)

9

Use the fact that whenm = 1, we have x1 = x (since there is only one observation), hence we have 121

i=1(xix)2 =0. Lets use x denote Dnew, then n+1 is

n+1 = n +n(x n)22(n + 1)

(104)

Substituting, we have the following,

p(Dnew|D) = (n+1)(n)

nnn+1n+1

(nn+1

) 12

(2pi)1/2 (105)

=(n + 1/2)

(n)

nn

(n +n(xn)2

2(n+1))n+1/2

(n

n + 1

) 12

(2pi)1/2 (106)

=((2n + 1)/2)

((2n)/2)

nn +

n(xn)2

2(n+1)

n+1/2

1

12n

(n

2(n + 1)

) 12

pi)1/2 (107)

=((2n + 1)/2)

((2n)/2)

11 + n(xn)

2

2n(n+1)

n+1/2(

n2n(n + 1)

) 12

(pi)1/2 (108)

= (pi)1/2((2n + 1)/2)

((2n)/2)

(nn

2nn(n + 1)

) 12(1 +

nn(x n)22nn(n + 1)

)(2n+1)/2(109)

Let = nnn(n+1) , then we have,

p(Dnew|D) = (pi)1/2((2n + 1)/2)((2n)/2)

(

2n

) 12(1 +

(x n)22n

)(2n+1)/2(110)

We can see this is a T-distribution with center at n, precision = nnn(n+1) , and degree of freedom 2n.

3.6 Reference analysisThe reference prior for NG is

p(m,) 1 = NG(m,| = , = 0, = 12 , = 0) (111)

So the posterior is

p(m,|D) = NG(n = x, n = n, n = (n 1)/2, n = 12ni=1

(xi x)2) (112)

So the posterior marginal of the mean is

p(m|D) = tn1(m|x,

i(xi x)2n(n 1) ) (113)

which corresponds to the frequentist sampling distribution of the MLE . Thus in this case, the confidence intervaland credible interval coincide.

4 Gamma priorIf is known, and only is unknown (e.g., when implementing Gibbs sampling), we can use the following results,which can be derived by simplifying the results for the Normal-NG model.

10

4.1 Likelihood

p(D|) n/2 exp(2

ni=1

(xi )2)

(114)

4.2 Prior

p() = Ga(|, ) 1e (115)

4.3 Posterior

p(|D) = Ga(|n, n) (116)n = + n/2 (117)

n = +12

ni=1

(xi )2 (118)

4.4 Marginal likelihoodTo be completed.4.5 Posterior predictive

p(x|D) = t2n(x|, 2 = n/n) (119)

4.6 Reference analysis

p() 1 = Ga(|0, 0) (120)

p(|D) = Ga(|n/2, 12mi=1

(xi )2) (121)

5 Normal-inverse-chi-squared (NIX) priorWe will see that the natural conjugate prior for 2 is the inverse-chi-squared distribution.5.1 LikelihoodThe likelihood can be written in this form

p(D|, 2) = 1(2pi)n/2

(2)n/2 exp

( 122

[n

ni=1

(xi x)2 + n(x )2])

(122)

5.2 PriorThe normal-inverse-chi-squared prior is

p(, 2) = NI2(0, 0, 0, 20) (123)

= N (|0, 2/0) 2(2|0, 20) (124)=

1

Zp(0, 0, 0, 20)1(2)(0/2+1) exp

( 122

[020 + 0(0 )2]

)(125)

Zp(0, 0, 0, 20) =

(2pi)0

(0/2)

(2

020

)0/2(126)

11

10.5

00.5

1

00.5

11.5

20

0.1

0.2

0.3

0.4

NIX(0=0.0, 0=1.0, 0=1.0, 20=1.0)

sigma2 10.5

00.5

1

00.5

11.5

20

0.2

0.4

0.6

0.8

NIX(0=0.0, 0=5.0, 0=1.0, 20=1.0)

sigma2

(a) (b)

10.5

00.5

1

00.5

11.5

20

0.1

0.2

0.3

0.4

NIX(0=0.0, 0=1.0, 0=5.0, 20=1.0)

sigma2 10.5

00.5

1

00.5

11.5

20

0.5

1

1.5

2

2.5

NIX(0=0.5, 0=5.0, 0=5.0, 20=0.5)

sigma2

(c) (d)

Figure 4: The NI2(0, 0, 0, 20) distribution. 0 is the prior mean and 0 is how strongly we believe this; 20 is the priorvariance and 0 is how strongly we believe this. (a) 0 = 0, 0 = 1, 0 = 1, 20 = 1. Notice that the contour plot (underneath thesurface) is shaped like a squashed egg. (b) We increase the strenght of our belief in the mean, so it gets narrower: 0 = 0, 0 =5, 0 = 1,

2

0 = 1. (c) We increase the strenght of our belief in the variance, so it gets narrower: 0 = 0, 0 = 1, 0 = 5, 20 = 1.(d) We strongly believe the mean and variance are 0.5: 0 = 0.5, 0 = 5, 0 = 5, 20 = 0.5. These plots were produced withNIXdemo2.

See Figure 4 for some plots. The hyperparameters0 and 2/0 can be interpreted as the location and scale of , andthe hyperparameters u0 and 20 as the degrees of freedom and scale of 2.

For future reference, it is useful to note that the quadratic term in the prior can be written as

Q0() = S0 + 0( 0)2 (127)= 0

2 2(00)+ (020 + S0) (128)

where S0 = 020 is the prior sum of squares.

12

5.3 Posterior(The following derivation is based on [Lee04, p67].) The posterior is

p(, 2|D) N (|0, 2/0)2(2|0, 20)p(D|, 2) (129)

[1(2)(0/2+1) exp

( 122

[020 + 0(0 )2]

)](130)

[(2)n/2 exp

( 122

[ns2 + n(x )2])] (131)

3(2)(n/2) exp( 122

[n2n + n(n )2]

)= NI2(n, n, n,

2n) (132)

Matching powers of 2, we find

n = 0 + n (133)To derive the other terms, we will complete the square. Let S0 = 020 and Sn = n2n for brevity. Grouping theterms inside the exponential, we have

S0 + 0(0 )2 + ns2 + n(x )2 = (S0 + 020 + ns2 + nx2) + 2(0 + n) 2(00 + nx)(134)Comparing to Equation 128, we have

n = 0 + n (135)nn = 00 + nx (136)

Sn + n2n = (S0 + 0

20 + ns

2 + nx2) (137)Sn = S0 + ns

2 + 020 + nx

2 n2n (138)One can rearrange this to get

Sn = S0 + ns2 + (10 + n

1)1(0 x)2 (139)= S0 + ns

2 +n0

0 + n(0 x)2 (140)

We see that the posterior sum of squares, Sn = n2n, combines the prior sum of squares, S0 = 020 , the sample sumof squares, ns2, and a term due to the uncertainty in the mean.

In summary,

n =00 + nx

n(141)

n = 0 + n (142)n = 0 + n (143)2n =

1

n(0

20 +

i

(xi x)2 + n00 + n

(0 x)2) (144)

The posterior mean is given by

E[|D] = n (145)E[2|D] = n

n 22n (146)

The posterior mode is given by (Equation 14 of [BL01]):mode[|D] = n (147)

mode[2|D] = n2n

n 1 (148)

13

The modes of the marginal posterior are

mode[|D] = n (149)mode[2|D] = n

2n

n + 2(150)

5.3.1 Marginal posterior of 2

First we integrate out , which is just a Gaussian integral.

p(2|D) =p(2, |D)d (151)

1(2)(n/2+1) exp( 122

[n2n]

)exp

( n22

(n )2])d (152)

1(2)(n/2+1) exp( 122

[n2n]

)(2pi)n

(153)

(2)(n/2+1) exp( 122

[n2n]

)(154)

= 2(2|n, 2n) (155)5.3.2 Marginal posterior of Let us rewrite the posterior as

p(, 2|D) = C1 exp( 12

[n2n + n(n )2]

)(156)

where = 2 and = (n + 1)/2. This follows since

1(2)(n/2+1) = 1n2 = n+1

2 1 = 1 (157)

Now make the substitutions

A = n2n + n(n )2 (158)

x =A

2(159)

d

dx= A

2x2 (160)

so

p(|D) =C(+1)eA/2d (161)

= (A/2)C(

A

2x)(+1)exx2dx (162)

Ax1exdx (163)

A (164)= (n

2n + n(n )2)(n+1)/2 (165)

[1 +

nn2n

( n)2](n+1)/2

(166)

tn(|n, 2n/n) (167)

14

5.4 Marginal likelihoodRepeating the derivation of the posterior, but keeping track of the normalization constants, gives the following.

p(D) =

P (D|, 2)P (, 2)dd2 (168)

=Zp(n, n, n,

2n)

Zp(0, 0, 0, 20)

1

ZNl(169)

=

0n

(n/2)

(0/2)

(0

20

2

)0/2( 2n2n

)n/2 1(2pi)(n/2)

(170)

=(n/2)

(0/2)

0n

(020)0/2

(n2n)n/2

1

pin/2(171)

5.5 Posterior predictive

p(x|D) =

p(x|, 2)p(, 2|D)dd2 (172)

=p(x,D)

p(D)(173)

=((n + 1)/2)

(n/2)

n

n + 1

(n2n)

n/2

(n2n +n

n+1(x n)2))(n+1)/2

1

pi1/2(174)

=((n + 1)/2)

(n/2)

(n

(n + 1)pin2n

) 12(1 +

n(x n)2(n + 1)n2n

)(n+1)/2(175)

= tn(n,(1 + n)

2n

n) (176)

5.6 Reference analysisThe reference prior is p(, 2) (2)1 which can be modeled by 0 = 0, 0 = 1, 0 = 0, since then we get

p(, 2) 1(2)( 12+1)e0 = 1(2)1/2 = 2 (177)

(See also [DeG70, p197] and [GCSR04, p88].)With the reference prior, the posterior is

n = x (178)n = n 1 (179)n = n (180)2n =

i(xi x)2n 1 (181)

p(, 2|D) n2 exp( 122

[i

(xi x)2 + n(x )2])

(182)

The posterior marginals are

p(2|D) = 2(2|n 1,

i(xi x)2n 1 ) (183)

p(|D) = tn1(|x,

i(xi x)2n(n 1) ) (184)

15

which are very closely related to the sampling distribution of the MLE. The posterior predictive is

p(x|D) = tn1(x,

(1 + n)

i(xi x)2n(n 1)

)(185)

Note that [Min00] argues that Jeffreys principle says the uninformative prior should be of the form

limk0N (|0,

2/k)2(2|k, 20) (2pi2)12 (2)1 3 (186)

This can be achieved by setting 0 = 0 instead of 0 = 1.6 Normal-inverse-Gamma (NIG) priorAnother popular parameterization is the following:

p(, 2) = NIG(m,V, a, b) (187)= N (|m,2V )IG(2|a, b) (188)

6.1 LikelihoodThe likelihood can be written in this form

p(D|, 2) = 1(2pi)n/2

(2)n/2 exp

( 122

[ns2 + n(x )2]) (189)

6.2 Prior

p(, 2) = NIG(m0, V0, a0, b0) (190)= N (|m0, 2V0)IG(2|a0, b0) (191)

This is equivalent to the NI2 prior, where we make the following substitutions.

m0 = 0 (192)V0 =

1

0(193)

a0 =02

(194)

b0 =0

20

2(195)

6.3 PosteriorWe can show that the posterior is also NIG:

p(, 2|D) = NIG(mn, Vn, an, bn) (196)V 1n = V

10 + n (197)

mnVn

= V 10 m0 + nx (198)an = a0 + n/2 (199)bn = b0 +

1

2[m20V

10 +

i

x2i m2nV 1n ] (200)

The NIG posterior follows directly from the NI2 results using the specified substitutions. (The bn term requiressome tedious algebra...)

16

6.3.1 Posterior marginalsTo be derived.6.4 Marginal likelihoodFor the marginal likelihood, substituting into Equation 171 we have

p(D) =(an)

(a0)

VnV0

(2b0)a0

(2bn)an1

pin/2(201)

=|Vn|

12

|V0|12

ba00bann

(an)

(a0)

1

pin/22a0an (202)

=|Vn|

12

|V0|12

ba00bann

(an)

(a0)

1

pin/22n(203)

6.5 Posterior predictiveFor the predictive density, substituting into Equation 176 we have

n(1 + n)2n

=1

( 1n + 1)2n

(204)

=2an

2bn(1 + Vn)(205)

So

p(y|D) = t2an(mn,bn(1 + Vn)

an) (206)

These results follow from [DHMS02, p240] by setting x = 1, = , BTB = n, BTX = nx, XTX = i x2i .Note that we use a difference parameterization of the student-t. Also, our equations for p(D) differ by a 2n term.

7 Multivariate Normal priorIf we assume is known, then a conjugate analysis of the mean is very simple, since the conjugate prior for the meanis Gaussian, the likelihood is Gaussian, and hence the posterior is Gaussian. The results are analogous to the scalarcase. In particular, we use the general result from [Bis06, p92] with the following substitutions:

x = , y = x,1 = 0, A = I, b = 0, L1 = /N (207)

7.1 Prior

p() = N (|0,0) (208)7.2 Likelihood

p(D|,) N (x|, 1N

) (209)

7.3 Posterior

p(|D,) = N (|N ,N ) (210)N =

(10 +N

1)1 (211)

N = N(N1x+10 0

) (212)17


p(x|D) = N (x|N ,+ N ) (213)

7.5 Reference analysis

p() 1 = N (|,I) (214)p(|D) = N (x,/n) (215)

8 Normal-Wishart priorThe multivariate analog of the normal-gamma prior is the normal-Wishart prior. Here we just state the results withoutproof; see [DeG70, p178] for details. We assume X is a d-dimensional.8.1 Likelihood

p(D|,) = (2pi)nd/2||n/2 exp( 12

ni=1

(xi )T(xi ))

(216)

8.2 Prior

p(,) = NWi(,|0, , , T ) = N (|0, ()1)Wi(|T ) (217)=

1

Z||12 exp

(2( 0)T( 0)

)||(d1)/2 exp ( 12 tr(T1)) (218)

Z =( 2pi

)d/2|T |/22d/2d(/2) (219)

Here T is the prior covariance. To see the connection to the scalar case, make the substitutions

0 =02, 0 =

T02

(220)

8.3 Posterior

p(,|D) = N (|n, (n)1)Win(|Tn) (221)n =

0 + nx

+ n(222)

Tn = T + S +n

+ n(0 x)(0 x)T (223)

S =

ni=1

(xi x)(xi x)T (224)

n = + n (225)n = + n (226)

Posterior marginals

p(|D) = Win(Tn) (227)p(|D) = tnd+1(|n,

Tnn(n d+ 1)) (228)

18

The MAP estimates are given by

(, ) = argmax,

p(D|,)NWi(,) (229)

=

ni=1

xi + 00N + 0 (230)

=

ni=1(xi )(xi )T + 0( 0)( 0)T + T10

N + 0 d (231)

This reduces to the MLE if 0 = 0, 0 = d and |T0| = 0.8.4 Posterior predictive

p(x|D) = tnd+1(n,Tn(n + 1)

n(n d+ 1)) (232)

If d = 1, this reduces to Equation 100.8.5 Marginal likelihoodThis can be computed as a ratio of normalization constants.

p(D) =ZnZ0

1

(2pi)nd/2(233)

=1

pind/2d(n/2)

d(0/2)

|T0|0/2|Tn|n/2

(0n

)d/2(234)

This reduces to Equation 95 if d = 1.8.6 Reference analysisWe set

0 = 0, 0 = 0, 0 = 1, |T0| = 0 (235)to give

p(,) ||(d+1)/2 (236)

Then the posterior parameters become

n = x, Tn = S, n = n, n = n 1 (237)

the posterior marginals become

p(|D) = tnd(|x, Sn(n d) ) (238)

p(|D) = Wind(|S) (239)

and the posterior predictive becomes

p(x|D) = tnd(x, S(n+ 1)n(n d) (240)

9 Normal-Inverse-Wishart priorThe multivariate analog of the normal inverse chi-squared (NIX) distribution is the normal inverse Wishart (NIW) (seealso [GCSR04, p85]).

19

9.1 LikelihoodThe likelihood is

p(D|,) ||n2 exp(12

ni=1

(xi )T1(xi ))

(241)

= ||n2 exp(12tr(1S)

)(242)(243)

where S is the matrix of sum of squares (scatter matrix)

S =

Ni=1

(xi x)(xi x)T (244)

9.2 PriorThe natural conjugate prior is normal-inverse-wishart

IW0(10 ) (245)| N(0,/0) (246)

p(,)def= NIW (0, 0,0, 0) (247)=

1

Z||((0+d)/2+1) exp

(12tr(0

1) 02( 0)T1( 0)

)(248)

Z =2v0d/2d(0/2)(2pi/0)

d/2

|0|0/2 (249)

9.3 PosteriorThe posterior is

p(,|D,0, 0,0, 0) = NIW (,|n, n,n, n) (250)n =

0+ 0 + ny

n=

00 + n

0 +n

0 + ny (251)

n = 0 + n (252)n = 0 + n (253)n = 0 + S +

0n

0 + n(x 0)(x 0)T (254)

The marginals are

|D IW (1n , n) (255)|D = tnd+1(n,

nn(n d+ 1)) (256)

To see the connection with the scalar case, note that n plays the role of n2n (posterior sum of squares), so

nn(n d+ 1) =

nnn

=2

n(257)

20


p(x|D) = tnd+1(n,n(n + 1)

n(n d+ 1)) (258)To see the connection with the scalar case, note that

n(n + 1)

n(n d+ 1) =n(n + 1)

nn=2(n + 1)

n(259)

9.5 Marginal likelihoodThe posterior is given by

p(,|D) = 1p(D)

1

Z0NIW (,|0) 1

(2pi)nd/2N (D|,) (260)

=1

ZnNIW (,|n) (261)

where

NIW (,|0) = ||((0+d)/2+1) exp(12tr(0

1) 02( 0)T1( 0)

)(262)

N (D|,) = ||n2 exp(12tr(1S)

)(263)

is the unnormalized prior and likelihood. Hence

p(D) =ZnZ0

1

(2pi)nd/2=

2nd/2d(n/2)(2pi/n)d/2

|n|n/2|0|0/2

20d/2d(0/2)(2pi/0)d/21

(2pi)nd/2(264)

=1

(2pi)nd/22nd/2

20d/2(2pi/n)

d/2

(2pi/0)d/2d(n/2)

d(0/2)(265)

=1

pind/2d(n/2)

d(0/2)

|0|0/2|n|n/2

(0n

)d/2(266)

This reduces to Equation 171 if d = 1.9.6 Reference analysis

A noninformative (Jeffreys) prior is p(,) ||(d+1)/2 which is the limit of 00, 0 1, |0|0 [GCSR04,p88]. Then the posterior becomes

n = x (267)n = n (268)n = n 1 (269)n = S =

i

(xi x)(xi x)T (270)

p(|D) = IWn1(|S) (271)p(|D) = tnd(|x, S

n(n d) ) (272)

p(x|D) = tnd(x|x, S(n+ 1)n(n d) ) (273)

Note that [Min00] argues that Jeffreys principle says the uninformative prior should be of the form

limk0N (|0,/k)IWk(|k) |2pi|

12 ||(d+1)/2 ||(d2+1) (274)

This can be achieved by setting 0 = 0 instead of 0 = 1.

21

0 1 2 3 4 50

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8Gamma(shape=a,rate=b)

a=0.5, b=1.0a=1.0, b=1.0a=1.5, b=1.0a=2.0, b=1.0a=5.0, b=1.0

0 1 2 3 4 50

0.5

1

1.5

2

2.5

3Gamma(shape=a,rate=b)

a=0.5, b=3.0a=1.0, b=3.0a=1.5, b=3.0a=2.0, b=3.0a=5.0, b=3.0

Figure 5: Some Ga(a, b) distributions. If a < 1, the peak is at 0. As we increase b, we squeeze everything leftwards and upwards.Figures generated by gammaDistPlot2.

10 Appendix: some standard distributions10.1 Gamma distributionThe gamma distribution is a flexible distribution for positive real valued rvs, x > 0. It is defined in terms of twoparameters. There are two common parameterizations. This is the one used by Bishop [Bis06] (and many otherauthors):

Ga(x|shape = a, rate = b) = ba

(a)xa1exb, x, a, b > 0 (275)

The second parameterization (and the one used by Matlabs gampdf) is

Ga(x|shape = , scale = ) = 1()

x1ex/ = Garate(x|, 1/) (276)

Note that the shape parameter controls the shape; the scale parameter merely defines the measurement scale (thehorizontal axis). The rate parameter is just the inverse of the scale. See Figure 5 for some examples. This distributionhas the following properties (using the rate parameterization):

mean =a

b(277)

mode = a 1b

for a 1 (278)

var =a

b2(279)

10.2 Inverse Gamma distributionLet X Ga(shape = a, rate = b) and Y = 1/X . Then it is easy to show that Y IG(shape = a, scale = b), wherethe inverse Gamma distribution is given by

IG(x|shape = a, scale = b) = ba

(a)x(a+1)eb/x, x, a, b > 0 (280)

22

0 0.5 1 1.5 20

0.2

0.4

0.6

0.8

1

1.2

1.4IG(a,b)

a=0.10, b=1.00a=1.00, b=1.00a=2.00, b=1.00a=0.10, b=2.00a=1.00, b=2.00a=2.00, b=2.00

Figure 6: Some inverse gamma distributions (a=shape, b=rate). These plots were produced by invchi2plot.

The distribution has these properties

mean =b

a 1 , a > 1 (281)

mode = ba+ 1

(282)

var =b2

(a 1)2(a 2) , a > 2 (283)

See Figure 6 for some plots. We see that increasing b just stretches the horizontal axis, but increasing a moves thepeak up and closer to the left.

There is also another parameterization, using the rate (inverse scale):

IG(x|shape = , rate = ) = 1a

(a)x(+1)e1/(x), x, , > 0 (284)

10.3 Scaled Inverse-Chi-squaredThe scaled inverse-chi-squared distribution is a reparameterization of the inverse Gamma [GCSR04, p575].

2(x|, 2) = 1(/2)

(2

2

)/2x

21 exp[

2

2x], x > 0 (285)

= IG(x|shape=2, scale=

2

2) (286)

where the parameter > 0 is called the degrees of freedom, and 2 > 0 is the scale. See Figure 7 for some plots. Wesee that increasing lifts the curve up and moves it slightly to the right. Later, when we consider Bayesian parameterestimation, we will use this distribution as a conjugate prior for a scale parameter (such as the variance of a Gaussian);increasing corresponds to increasing the effective strength of the prior.

23

0 0.5 1 1.5 20

0.5

1

1.52(,2)

=1.00, 2=0.50=1.00, 2=1.00=1.00, 2=2.00=5.00, 2=0.50=5.00, 2=1.00=5.00, 2=2.00

Figure 7: Some inverse scaled 2 distributions. These plots were produced by invchi2plot.

The distribution has these properties

mean =2

2 for > 2 (287)

mode = 2

+ 2(288)

var =224

( 2)2( 4) for > 4 (289)

The inverse chi-squared distribution, written 2 (x), is the special case where 2 = 1 (i.e., 2 = 1/). Thiscorresponds to IG(a = /2, b = scale = 1/2).10.4 Wishart distributionLet X be a p dimensional symmetric positive definite matrix. The Wishart is the multidimensional generalization ofthe Gamma. Since it is a distribution over matrices, it is hard to plot as a density function. However, we can easilysample from it, and then use the eigenvectors of the resulting matrix to define an ellipse. See Figure 8.

There are several possible parameterizations. Some authors (e.g., [Bis06, p693], [DeG70, p.57],[GCSR04, p574],wikipedia) as well as WinBUGS and Matlab (wishrnd), define the Wishart in terms of degrees of freedom pand the scale matrix S as follows:

Wi(X|S) = 1Z|X|(p1)/2 exp[ 12 tr(S1X)] (290)

Z = 2p/2p(/2)|S|/2 (291)where p(a) is the generalized gamma function

p() = pip(p1)/4

pi=1

(2+ 1 i

2

)(292)

(So 1() = ().) The mean and mode are given by (see also [Pre05])mean = S (293)mode = ( p 1)S, > p+ 1 (294)

24

5 0 55

0

5

5 0 55

0

5

4 2 0 2 4

2

0

2

10 0 10

5

0

5

5 0 54

2

0

2

4

5 0 55

0

5

5 0 54

2

0

2

4

10 0 10

5

0

5

Wishart(dof=2.0,S=[4 3; 3 4])

10 0 10

5

0

5

10 0 10

105

05

10

20 0 20

10

0

10

20 0 20

10

0

10

10 0 1010

5

05

10

10 0 10

5

0

5

20 0 20

10

0

10

10 0 1010

5

0

5

10

10 0 10

105

05

10

Wishart(dof=10.0,S=[4 3; 3 4])

10 0 10

105

05

10

Figure 8: Some samples from the Wishart distribution. Left: = 2, right: = 10. We see that if if = 2 (the smallest validvalue in 2 dimensions), we often sample nearly singular matrices. As increases, we put more mass on the S matrix. If S = I2,the samples would look (on average) like circles. Generated by wishplot.

In 1D, this becomes Ga(shape = /2, rate = S/2).Note that if X Wiu(S), and Y = X1, then Y IW(S1) and E[Y ] = Sd1 .In [BS94, p.138], and the wishpdf in Tom Minkas lightspeed toolbox, they use the following parameterization

Wi(X|a,B) = |B|a

p(a)|X|a(p+1)/2 exp[tr(BX)] (295)

We require that B is a pp symmetric positive definite matrix, and 2a > p1. If p = 1, so B is a scalar, this reducesto the Ga(shape = a, rate= b) density.

To get some intuition for this distribution, recall that tr(AB) is a vector which contains the inner product of therows of A and the columns of B. In Matlab notation we have

trace(A B) = [a(1,:)*b(:,1), ..., a(n,:)*b(:,n)]

If X Wi(S), then we are performing a kind of template matching between the columns of X and S1 (recall thatboth X and S are symmetric). This is a natural way to define the distance between two matrices.10.5 Inverse WishartThis is the multidimensional generalization of the inverse Gamma. Consider a dd positive definite (covariance) ma-trix X and a dof parameter > d1 and psd matrix S. Some authors (eg [GCSR04, p574]) use this parameterization:

IW(X|S1) = 1Z|X|(+d+1)/2 exp

(12Tr(SX1)

)(296)

Z =|S|/2

2d/2d(/2)(297)

where

d(/2) = pid(d1)/4

di=1

( + 1 i

2) (298)

25

The distribution has mean

E X =S

d 1 (299)In Matlab, use iwishrnd. In the 1d case, we have

2(|0, 20) = IW0(|(020)1) (300)Other authors (e.g., [Pre05, p117]) use a slightly different formulation (with 2d < )

IW(X|Q) =2(d1)d/2pid(d1)/4 d

j=1

(( d j)/2)1

(301)

|Q|(d1)/2|X|/2 exp(12Tr(X1Q)

)(302)

which has mean

E X =Q

2d 2 (303)

10.6 Student t distributionThe generalized t-distribution is given as

t(x|, 2) = c[1 +

1

(x

)2]( +12 )

(304)

c =(/2 + 1/2)

(/2)

1pi

(305)

where c is the normalization consant. is the mean, > 0 is the degrees of freedom, and 2 > 0 is the scale. (Notethat the parameter is often written as a subscript.) In Matlab, use tpdf.

The distribution has the following properties:mean = , > 1 (306)mode = (307)

var =2

( 2) , > 2 (308)

Note: if x t(, 2), thenx

t (309)which corresponds to a standard t-distribution with = 0, 2 = 1:

t(x) =(( + 1)/2)pi(/2)

(1 + x2/)(+1)/2 (310)

In Figure 9, we plot the density for different parameter values. As , the T approaches a Gaussian. T-distributions are like Gaussian distributions with heavy tails. Hence they are more robust to outliers (see Figure 10).

If = 1, this is called a Cauchy distribution. This is an interesting distribution since if X Cauchy, then E[X ]does not exist, since the corresponding integral diverges. Essentially this is because the tails are so heavy that samplesfrom the distribution can get very far from the center .

It can be shown that the t-distribution is like an infinite sum of Gaussians, where each Gaussian has a differentprecision:

p(x|, a, b) =N (x|, 1)Ga( |a, rate = b)d (311)

= t2a(x|, b/a) (312)(See exercise 2.46 of [Bis06].)

26

6 4 2 0 2 4 60.05

0

0.05

0.1

0.15

0.2

0.25

0.3

0.35

0.4Student T distributions

t(=0.1)t(=1.0)t(=5.0)N(0,1)

Figure 9: Student t-distributions T (, 2, ) for = 0. The effect of is just to scale the horizontal axis. As , thedistribution approaches a Gaussian. See studentTplot.

(a)5 0 5 100

0.1

0.2

0.3

0.4

0.5

(b)5 0 5 100

0.1

0.2

0.3

0.4

0.5

Figure 10: Fitting a Gaussian and a Student distribution to some data (left) and to some data with outliers (right). The Studentdistribution (red) is much less affected by outliers than the Gaussian (green). Source: [Bis06] Figure 2.16.

27

21

01

2

2

1

0

1

20

0.05

0.1

0.15

0.2

T distribution, dof 2.0

21

01

2

21

01

20

0.5

1

1.5

2

Gaussian

Figure 11: Left: T distribution in 2d with dof=2 and = 0.1I2. Right: Gaussian density with = 0.1I2 and = (0, 0); we seeit goes to zero faster. Produced by multivarTplot.

10.7 Multivariate t distributionsThe multivariate T distribution in d dimensions is given by

t(x|,) = (/2 + d/2)(/2)

||1/2vd/2pid/2

[1 +

1

(x )T1(x )

]( +d2 )(313)(314)

where is called the scale matrix (since it is not exactly the covariance matrix). This has fatter tails than a Gaussian:see Figure 11. In Matlab, use mvtpdf.

The distribution has the following properties

E x = if > 1 (315)mode x = (316)Cov x =

2 for > 2 (317)

(The following results are from [Koo03, p328].) Suppose Y T (,, ) and we partition the variables into 2blocks. Then the marginals are

Yi T (i,ii, ) (318)and the conditionals are

Y1|y2 T (1|2,1|2, + d1) (319)1|2 = 1 +12

122 (y2 2) (320)

1|2 = h1|2(11 12122 T12) (321)h1|2 =

1

+ d2

[ + (y2 2)T122 (y2 2)

] (322)We can also show linear combinations of Ts are Ts:

Y T (,, ) AY T (A,AA, ) (323)

We can sample from a y T (,, ) by sampling x T (0, 1, ) and then transforming y = + RTx, whereR = chol(), so RTR = .

28

References[Bis06] C. Bishop. Pattern recognition and machine learning. Springer, 2006.[BL01] P. Baldi and A. Long. A Bayesian framework for the analysis of microarray expression data: regularized

t-test and statistical inferences of gene changes. Bioinformatics, 17(6):509519, 2001.[BS94] J. Bernardo and A. Smith. Bayesian Theory. John Wiley, 1994.

[DeG70] M. DeGroot. Optimal Statistical Decisions. McGraw-Hill, 1970.[DHMS02] D. Denison, C. Holmes, B. Mallick, and A. Smith. Bayesian methods for nonlinear classification and

regression. Wiley, 2002.[DMP+06] F. Demichelis, P. Magni, P. Piergiorgi, M. Rubin, and R. Bellazzi. A hierarchical Naive Bayes model

for handling sample heterogeneity in classification problems: an application to tissue microarrays. BMCBioinformatics, 7:514, 2006.

[GCSR04] A. Gelman, J. Carlin, H. Stern, and D. Rubin. Bayesian data analysis. Chapman and Hall, 2004. 2ndedition.

[GH94] D. Geiger and D. Heckerman. Learning Gaussian networks. Technical Report MSR-TR-94-10, MicrosoftResearch, 1994.

[Koo03] Gary Koop. Bayesian econometrics. Wiley, 2003.[Lee04] Peter Lee. Bayesian statistics: an introduction. Arnold Publishing, 2004. Third edition.[Min00] T. Minka. Inferring a Gaussian distribution. Technical report, MIT, 2000.[Pre05] S. J. Press. Applied multivariate analysis, using Bayesian and frequentist methods of inference. Dover,

2005. Second edition.

29

kevin p. murphy - conjugate gaussian analysis bayes

Documents