1 introduction to mcmc methods, the gibbs sampler, and data augmentation

Introduction to MCMC methods, the Gibbs Sampler, and Data Augmentation

Simulation Methods

problem:

Bayes theorem allows us to write down unnormalized density which proportional to the posterior for virtually any model,

construct a simulator

dim() > 100

Solutions:

1. direct iid simulator (use asymptotics)

2. Importance sampling

3. MCMC

MCMC Methods

“solution”

exploit special structure of the problem to:

formulate a Markov Chain on parameter space with π as long-run or “equilibrium distribution.

simulate from MC, starting from some point

Use sub-sequence of draws as simulator

MCMC Methods

Start from 0

construct a sequence of r.v. 1 r, , ,

r 1 r r 1 r 1 r

r 1 r r

, , Markovian Property

under some conditions on F,

r 0 "converges" to

Ergodicity

estimate E g g d ;

lim ergodic propertyˆ

1ˆi) p Pr A d ; p IR

This means that we can estimate any aspect of the joint distribution using sequences of draws from MC.

Denote the sequence of draws as:

1 r, , ,

Practical Considerations

Effect of initial conditions-

“burn-in” -- run for B iterations, discard and use only last R-B

Non-iid Simulator-

Is this a problem?

no: LLN works for dep sequences

yes: simulation error larger than iid seq

Method for Constructing the Chain!

Asymptotics

Any simulation-based method relies on asymptotics for justification.

We have made fun of asymptotics for inference problems. Classical Econometrics – “approximate answer” to the wrong question.

We are not using asymptotics to approximate for a fixed sample size.

The sample size is large and under our control!

Simulating from Bivariate Normal

21 2 1 1

1~ N 0,

~ N 0,1 and ~ N , 1

In R, we would use the Cholesky root to simulate:

22 1 2

Lz ; z ~ N 0,I

Gibbs SamplerA joint distribution can always be factored into a marginal × a conditional. There is also a sense in which the conditional distributions fully summarize the joint.

2 22 1 1 1 2 2~ N , 1 ~ N , 1

A simulator: Start at point

1 0 22 1

1 1 21 2

~ N ,1

Draw in two steps: 1

Note: this is a Markov Chain. Current point entirely summarizes past.

Gibbs Sampler

A simulator:

Start at point

1 0 22 1

1 1 21 2

~ N ,1

Draw in two steps: 1

repeat!

Hammersley-Clifford TheoremExistence of GS for bivariate distribution implies that the complete set of conditionals summarize all info in the joint.

H-C Construction:

2 11 2

Hammersley-Clifford Theorem

2 1 1 22 2 2

1 2 1 11 2

p p p 1d d d

p , p pp

2 1 2 11 2 1 2

p pp , p ,

1pd pp

rbiNormGibbs

theta1

-3 -2 -1 0 1 2 3

Gibbs Sampler with Intermediate Moves: Rho = 0.9

theta1

-3 -2 -1 0 1 2 3

theta1

-3 -2 -1 0 1 2 3

theta1

-3 -2 -1 0 1 2 3

Intuition for dependence

This is a Markov Chain!

Average step “size” :

rbiNormGibbs

0 5 10 15 20

Series 1

0 5 10 15 20

Series 1 & Series 2

-20 -15 -10 -5 0

Series 2 & Series 1

0 5 10 15 20

Series 2

non-iid draws!

Who cares?

Loss of Efficiency

Ergodicity

0 5 10 15 20 25 30

ACF of Theta1

0 200 400 600 800 1000

Convergence of Sample Correlation

r i i11 1 2 2r i 1

r 2 2r ri i1 11 1 2 2r ri 1 i 1

iid draws

Gibbs Sampler draws

Relative Numerical EfficiencyDraws from the Gibbs Sampler come from a stationary yet autocorrelated process. We can compute the sampling error of averages of these draws.

Assume we wish to estimate

We would use:

r r1 1R Rr r

1 1 2 1 R

1R 2 1 2 R

var g cov g ,g cov g ,gvar ˆ

cov g ,g var g var g

Relative Numerical Efficiency

R 1 R jj1 RRj

var g var gvar 1 2ˆ

Ratio of variance to variance if iid.

m 1 jR jm 1

f 1 2 ˆ

Here we truncate the lag at m. Choice of m?

numEff in bayesm

General Gibbs sampler

’ = (1, 2, …, p)

Sample from: 1,1 = f1(1| 0,2, …, 0,p) 1,2 = f2(2| 1,1, 0,3, …, 0,p)

1,p = fp(p| 1,1, …, 1,p-1)

to obtain the first iterate

where fi = () / () d-i

-i = (1,2, …,i-1, i+1, …,p)

“Blocking”

Different prior for Bayes Regression

Suppose the prior for β does not depend on σ2: p(,2) = p() p(2). That is, prior belief about β does not depend on 2. Why should views about depend on scale of error terms? Only true for data-based prior information NOT for subject matter information!

1p( ) exp ( )' A( )

0 212 2 2 0 0

sp( ) ( ) exp

Different posterior

The posterior for σ2 now depends on β:

22 1 1

22 0 01

[ |y,X, ] N( ,( X 'X A) )

ˆwith ( X 'X A) ( X 'X A )

ˆ (X 'X) X 'y

s[ | y,X, ] with n

s (y X ) (y X )s

Depends on

Different simulation strategy

3) Repeat

2) Draw [2 | y, X, ] (conditional on !)

1) Draw [ | y, X, 2]

Scheme: [y|X, , 2] [] [2]

runiregGibbsruniregGibbs=function(Data,Prior,Mcmc){# # Purpose:# perform Gibbs iterations for Univ Regression Model using# prior with beta, sigma-sq indep# # Arguments:# Data -- list of data # y,X# Prior -- list of prior hyperparameters# betabar,A prior mean, prior precision# nu, ssq prior on sigmasq# Mcmc -- list of MCMC parms# sigmasq=initial value for sigmasq# R number of draws# keep -- thinning parameter# # Output: # list of beta, sigmasq#

runiregGibbs (continued)# Model:# y = Xbeta + e e ~N(0,sigmasq)# y is n x 1# X is n x k# beta is k x 1 vector of coefficients## Priors: beta ~ N(betabar,A^-1)# sigmasq ~ (nu*ssq)/chisq_nu# ## check arguments#.sigmasqdraw=double(floor(Mcmc$R/keep))betadraw=matrix(double(floor(Mcmc$R*nvar/keep)),ncol=nvar)XpX=crossprod(X)Xpy=crossprod(X,y)sigmasq=as.vector(sigmasq)

itime=proc.time()[3]cat("MCMC Iteration (est time to end - min) ",fill=TRUE)flush()

runiregGibbs (continued)

for (rep in 1:Mcmc$R){## first draw beta | sigmasq# IR=backsolve(chol(XpX/sigmasq+A),diag(nvar)) btilde=crossprod(t(IR))%*%(Xpy/sigmasq+A%*%betabar) beta = btilde + IR%*%rnorm(nvar)## now draw sigmasq | beta# res=y-X%*%beta s=t(res)%*%res sigmasq=(nu*ssq + s)/rchisq(1,nu+nobs) sigmasq=as.vector(sigmasq)

runiregGibbs (continued)

##print time to completion and draw # every 100th draw# if(rep%%100 == 0) {ctime=proc.time()[3] timetoend=((ctime-itime)/rep)*(R-rep) cat(" ",rep," (",round(timetoend/60,1),")",fill=TRUE) flush()}

if(rep%%keep == 0) {mkeep=rep/keep; betadraw[mkeep,]=beta; sigmasqdraw[mkeep]=sigmasq}}ctime = proc.time()[3]cat(' Total Time Elapsed: ',round((ctime-itime)/60,2),'\n')

list(betadraw=betadraw,sigmasqdraw=sigmasqdraw)}

R session

set.seed(66)n=100X=cbind(rep(1,n),runif(n),runif(n),runif(n))beta=c(1,2,3,4)sigsq=1.0y=X%*%beta+rnorm(n,sd=sqrt(sigsq))

A=diag(c(.05,.05,.05,.05))betabar=c(0,0,0,0)nu=3ssq=1.0

R=1000

Data=list(y=y,X=X)Prior=list(A=A,betabar=betabar,nu=nu,ssq=ssq)Mcmc=list(R=R,keep=1)

out=runiregGibbs(Data=Data,Prior=Prior,Mcmc=Mcmc)

R session (continued)

Starting Gibbs Sampler for Univariate Regression Model with 100 observations Prior Parms: betabar[1] 0 0 0 0A [,1] [,2] [,3] [,4][1,] 0.05 0.00 0.00 0.00[2,] 0.00 0.05 0.00 0.00[3,] 0.00 0.00 0.05 0.00[4,] 0.00 0.00 0.00 0.05nu = 3 ssq= 1 MCMC parms: R= 1000 keep= 1

R session (continued)

MCMC Iteration (est time to end - min) 100 ( 0 ) 200 ( 0 ) 300 ( 0 ) 400 ( 0 ) 500 ( 0 ) 600 ( 0 ) 700 ( 0 ) 800 ( 0 ) 900 ( 0 ) 1000 ( 0 ) Total Time Elapsed: 0.01

0 200 400 600 800 1000

Draws of Beta

0 200 400 600 800 1000

Draws of Sigma Squared

Data Augmentation

GS is well-suited for linear models. Extends to conditionally conjugate models, e.g. SUR.

Data Augmentation extends class of models which can be analyzed via GS.

origins: missing data

traditional approach:

obs missingp y y y ,y

obs obs

obs miss miss

p y p y p

p y ,y dy p

Data Augmentation

Solution: regard ymiss as what it is: an unobservable! Tanner and Wong (87)

miss obs obs miss

miss obs obs

p ,y y p y ,y p

p y y , p y p

miss obs

complete data posterior!

miss" "yunder “ignorable”

missing data assumption

Data Augmentation-Probit Ex

Consider the Binary Probit model:

'i i i i

1 if z 0y

0 otherwise

z x ~ N 0,1

Z is a latent, unobserved variable

p y x, p y,z x, dz p y z,x, p z x, dz

f z p z x, dz

Pr y 1 p z x, dz Pr x ' x '

Pr y 0 x '

Integrate out z to obtain likelihood

Data augmentation

All unobservables are objects of inference, including parameters and latent variables. Augment β with z.

For Probit, we desire the joint posterior of latents and β.

p(z, |y) p z ,y p z,y p z ,y p z

Conditional independence of y,β.

Gibbs Sampler:

Probit conditional distributions

[z|β, y]

This is a truncated normal distribution:

if y = 1, truncation is from below at 0 (z > 0, z=x’β + , > -x’β)

if y = 0, truncation is from above

How do we make these draws? We use the inverse CDF method.

Inverse cdf

If X ~ F U ~ Uniform[0,1] Then F-1(U) = X

Let G be the cdf of X truncated to [a,b]

F(x) F(a)

G(x)F(b) F(a)

Inverse cdf

what is G-1? solve G(x) = y

F(x) F(a)

yF(b) F(a)

F(x) y(F(b) F(a)) F(a)

1x F (y(F(b) F(a)) F(a))

Draw u ~ U(0,1)

x F (u(F(b) F(a)) F(a))

rtrun=function(mu,sigma,a,b){# function to draw from univariate truncated norm# a is vector of lower bounds for truncation# b is vector of upper bounds for truncation#FA=pnorm(((a-mu)/sigma))FB=pnorm(((b-mu)/sigma))mu+sigma*qnorm(runif(length(mu))*(FB-FA)+FA)}

Probit conditional distributions

[|z,X] [z|X,] []

[ |y,X] Normal( ,(X 'X A) )

ˆ(X 'X A) (X 'X A )

ˆ (X 'X) X 'z

1 1[ | ,A ]~N( ,A )

Standard Bayes regression with unit error variance!

rbprobitGibbsrbprobitGibbs=function(Data,Prior,Mcmc){## purpose: # draw from posterior for binary probit using Gibbs Sampler## Arguments:# Data - list of X,y # X is nobs x nvar, y is nobs vector of 0,1# Prior - list of A, betabar# A is nvar x nvar prior preci matrix# betabar is nvar x 1 prior mean# Mcmc# R is number of draws# keep is thinning parameter## Output:# list of betadraws# Model: y = 1 if w=Xbeta + e > 0 e ~N(0,1)## Prior: beta ~ N(betabar,A^-1)

rbprobitGibbs (continued)# define functions needed#breg1=function(root,X,y,Abetabar) {# Purpose: draw from posterior for linear regression, sigmasq=1.0# # Arguments:# root is chol((X'X+A)^-1)# Abetabar = A*betabar## Output: draw from posterior# # Model: y = Xbeta + e e ~ N(0,I)## Prior: beta ~ N(betabar,A^-1)#cov=crossprod(root,root)betatilde=cov%*%(crossprod(X,y)+Abetabar)betatilde+t(root)%*%rnorm(length(betatilde))}.. (error checking part of code).

rbprobitGibbs (continued)

betadraw=matrix(double(floor(R/keep)*nvar),ncol=nvar)

beta=c(rep(0,nvar))

sigma=c(rep(1,nrow(X)))

root=chol(chol2inv(chol((crossprod(X,X)+A))))

Abetabar=crossprod(A,betabar)

a=ifelse(y == 0,-100, 0)

b=ifelse(y == 0, 0, 100)#

# start main iteration loop

itime=proc.time()[3]

cat("MCMC Iteration (est time to end - min) ",fill=TRUE)

flush()

if y = 0, truncate to (-100,0)

if y = 1, truncate to (0, 100)

rbprobitGibbs (continued)

for (rep in 1:R)

mu=X%*%beta

z=rtrun(mu,sigma,a,b)

beta=breg1(root,X,z,Abetabar)

Binary probit example

## rbprobitGibbs example

set.seed(66)

simbprobit=

function(X,beta) {

## function to simulate from binary probit including x variable

y=ifelse((X%*%beta+rnorm(nrow(X)))<0,0,1)

list(X=X,y=y,beta=beta)

nobs=100X=cbind(rep(1,nobs),runif(nobs),runif(nobs),runif(nobs))beta=c(-2,-1,1,2)nvar=ncol(X)simout=simbprobit(X,beta)

Data=list(X=simout$X,y=simout$y)Mcmc=list(R=2000,keep=1)

out=rbprobitGibbs(Data=Data,Mcmc=Mcmc)

cat(" Betadraws ",fill=TRUE)mat=apply(out$betadraw,2,quantile,probs=c(.01,.05,.5,.95,.99))mat=rbind(beta,mat); rownames(mat)[1]="beta"; print(mat)

0 500 1000 1500 2000

Probit Beta Draws

Summary statistics

Betadraws [,1] [,2] [,3] [,4]

beta -2.000000 -1.00000000 1.00000000 2.000000

1% -4.113488 -2.69028853 -0.08326063 1.392206

5% -3.588499 -2.19816304 0.20862118 1.867192

50% -2.504669 -1.04634198 1.17242924 2.946999

95% -1.556600 -0.06133085 2.08300392 4.166941

99% -1.233392 0.34910141 2.43453863 4.680425

Pr y 1x, x '

Probability | x=(0,.1,0)

0.45 0.50 0.55 0.60

20Probability | x=(0,4,0)

0.0 0.2 0.4 0.6 0.8 1.0

Example from BSM:

Mixtures of normals

i ind ind

y ~ N ,

ind ~ Multinomial( pvec)

A general flexible model or a non-parametric method of density approximation?

indi is a augmented variable that points to whichnormal distribution is associated with observation i.ind is an indicator variable that classifies observations one of the length(pvec) components.

i k k kky ~ N ,

Model hierarchy

pvec indk

Model [pvec][ind|pvec][k|ind][k|ind,k][Y|k,k]

Conditionals [pvec|ind,priors][ind|pvec,{k,k},y][{k,k}|ind,y,priors]

Priors :

pvec ~ Dirichlet

~ IW ,V

~ N , a

k 1, ,K

Gibbs Sampler for Mixture of Normals

Conditionals [pvec|ind,priors]

[ind|pvec,{k,k},y]

k k k k ii 1

pvec ~ Dirichlet

n ; n I ind k

i i i,1 i,K

i k ki,k k

i k kk

ind ~ multinomial ; ' ,...,

y ,pvec

φ( ) is the multivariate normal density

Gibbs Sampler for Mixtures of Normals

[{k,k}|ind,y,priors]

'k k i k

Y U; U ; u ~ N 0,

1k k k k kn a

Y , ,V ~ IW n ,V S

Y , , ,a ~ N ,

given ind (classification), this is just a MRM!

'* ' * 'k k k k

k k k k

'k k k

n a n Y a

Y Y / n

Identification for Mixtures of Normals

Likelihood for mixture of K normals can have up to K! modes of equal height!

So-called “label” switching problem: I can permute the labels of each component without changing likelihood.

Implies the Gibbs Sampler may not navigate all modes! Who cares?

Joint density or any function of this is identified!

Label-Switching Example

Consider a mixture of two univariate normals that are not very “separated” and with a relatively small amount of data. Density of y is unimodal with mode a 1.5

y .5N 1,1 .5N 2,1

0 20 40 60 80 100

Label-switches

Label-Switching Example

Density of y is identified. Using Gibbs Sampler, we get R draws from posterior of joint density

-1 0 1 2 3 4

1 1 2 2p y p y , 1 p y ,

We use unconstrained Gibbs Sampler (rnmixGibbs).Others advocate restrictions or post-processing of draws to identify components

Pros: superior mixingfocuses attention on identified quantities

Cons:can’t make inferences about component parmsmust summarize posterior of joint density!

In practice, what is the implication of label-switching?

we can’t use:

but we can use

r1k kR r

r r r1k k kR r k

E p y p y ,

Multivariate Mix of Norms Ex

1 2 1 3 1

; 2 ; 3 ;3

1 .5 .5

.5 .5 1

pvec 1/ 3

1/ 6 0 100 200 300 400

Multivariate Mix of Norms Ex

r r r r r r rk k k k k

r 1 r 1 k 1

1 1ˆ ˆp y p y , ,p p y ,R R

0 5 10 15

draw 100

Bivariate Distributions and Marginals

0 2 4 6 8 10

ity -1 0 1 2 3 4 5

True Bivariate Marginal

-1 0 1 2 3 4 5

8Posterior Mean of Bivariate Marginal

1 introduction to mcmc methods, the gibbs sampler, and data augmentation

mcmc slide

simulator slide

rbinormgibbs slide

vectorsigmasq slide

bayesm slide

p blocking slide

data augmentation slide

true flush slide

Documents

next-generation mcmc: theory, options, and practice for...

mcmc in structure space mcmc in order space

genetic evaluation of micronutrient traits in diploid...

gibbs sampling (an mcmc method) and relations to em

mcmc methods: gibbs and metropolis · the...

mcmc methods: gibbs and...

mcmc on bersih

automated sensitivity computations for mcmc gibbs...

introduction to stan - moodle@units · with stan, we aim to...

lect4: exact sampling techniques and mcmc convergence...

the multi stage gibbs sampling: data augmentation dutch

mcmc methods: gibbs sampling and the metropolis-hastings...

gibbs sampling - dtu health tech · 2011. 10. 27. · gibbs...

internet of things (iot) - skmm.gov.my · idi norbarkhtiar...

µjdinuri/courses/20-expanders/...x config hx energy...

mcmc and gibbs sampling · approaches to inference lexact...

clone mcmc: parallel high-dimensional gaussian gibbs...

sampling from the complement of a polyhedron: an mcmc...

mcmc gibbs sampler. exercises. -...

mcmc: does it work? how can we tell? - school of...