a slowly mixing markov chain with implications for gibbs sampling

Statistics & Probability Letters 17 (1993) 231-236

North-Holland

18 June 1993

A slowly mixing Markov chain with implications for Gibbs sampling

Peter Matthews University of Maryland Baltimore County, Baltimore, MD, USA

Received December 1991

Revised October 1992

Abstract: We give a Markov chain that converges to its stationary distribution very slowly. It has the form of a Gibbs sampler

running on a posterior distribution of a parameter f3 given data X. Consequences for Gibbs sampling are discussed.

Keywords: Gibbs sampling; posterior distribution; mixing rate; coupling.

1. Introduction

Gibbs sampling is a Monte Carlo technique that has seen explosive growth recently in Bayesian statistics. See, for example, Gelfand and Smith (1990). Gibbs sampling has deservedly been presented optimisti- cally as a means of attacking heretofore intractable posterior calculations. Though it is understood that Gibbs sampling can fail, this possibility has not received the careful attention it deserves. One difficulty is the number of steps necessary to give near convergence to the stationary (posterior) distribution. Gelfand, Hills, Racine-Poon and Smith (1990) suggest monitoring the sample paths of the Gibbs sampler until convergence is apparently achieved. They make it clear that this methodology is non-rigorous. The purpose of this note is to show there are situations where this methodology can fail. We give an example in which, with high probability, a Gibbs sampler will appear to converge when in fact true convergence takes much longer. Gelman and Rubin (1992) give other examples of non-convergent Gibbs samplers appearing to converge. The example presented here is mathematically simple to facilitate theoretical computations. However, it should be realistic enough to suggest that care is needed when using a Gibbs sampler.

The remainder of this note is as follows. Section 2 states the model. Section 3 gives a lower bound on its rate of convergence to uniformity. Section 4 gives an example for particular, somewhat realistic, values of the model parameters. Section 5 discusses the issues raised in this note.

2. The general model and Gibbs sampling

We consider sampling from the posterior distribution of a d-dimensional parameter 13 given some data X. Let C denote the open d-dimensional hypercube (0, ljd. We take the prior distribution n(0) to be

Correspondence to: Peter Matthews, Department of Mathematics and Statistics, University of Maryland Baltimore County,

Baltimore, MD 21228, USA.

Research supported by NSF and NSA under NSF grand DMS 9001295.

0167.7152/93/$06.00 0 1993 - Elsevier Science Publishers B.V. All rights reserved 231

Volume 17, Number 3 STATISTICS & PROBABILITY LETTERS 18 June 1993

uniform on C. For some known parameters 6 E (0, 1) and (T > 0 we take

d

Thus the distribution of X given 0 is a 1 - S, 6 mixture of N(B, a*I) and the uniform distribution on C. The posterior II(0 1X) =f(X I O)II(O>/m(X> is proportional to f(X 113) on C. The marginal m(X) need not play any role in Gibbs sampling; we consider only sampling from a density proportional to

f(x le>. We consider Gibbs sampling implemented as follows. An initial value 8’ = (@, . . . ,f$ E C is chosen.

We will choose 8’ with distribution n(e), the uniform distribution on C. A Markov chain on C is run for t steps. Each step of the chain consists of updating each of the d components of 0 in turn. Suppose the current state is (ei ,,...,e;_,, ei-l,..., eg-'1. That is, we are making the transition from e’-l to 8’, components 1,. . . , k - 1 have been updated, and component k is to be updated next. The Gibbs sampler chooses a value t$ from the distribution f(ek I O1 = @, . . . , Ok_, = ei_,, Ok+l = 86;‘,, . . . , Od = ei-‘, X). These t steps of d updates may then be repeated independently m times from independently chosen initial positions.

In our example the distribution He’) of 8’ converges at an exponential rate to the posterior n(t? I X) in total variation. One can easily show that the one-step transition function of the Markov chain is bounded from below (though the bound depends on X). Thus the Doeblin condition (Doob, 1953, p.

256) is met and geometric ergodicity follows. That is,

II_5qet)-n(eIx)II =SUP I~(e~~A)-nn(Alx)l ~hfr' A

for some M > 0 and I < 1. It4 and r may depend on X. Further, with a deterministic initial position one can show uniform (in the initial position) geometric ergodicity. This is an asymptotic result. The exponent Y may be arbitrarily close to 1, so for any t of interest this variation distance may be arbitrarily close to 1. Doss and Sethuraman (1991) and Tierney (1991) give more general conditions under which the distribution _Y(B’) of 0’ converges at an exponential rate to the posterior n(f3 1 X) in total variation.

3. A lower bound on the mixing rate

In this section we use coupling to give an upper bound on the variation distance between _Y(e’) and the prior II - not the posterior L!(0 1 X). Actually we show more; we show that the entire sample path

eO,..., 8’ has a law that is close in total variation to the distribution of t + 1 independent random vectors, each uniformly distributed on C. We do this by showing that, with high probability, the process eO,..., 8’ can be coupled with the output of a Gibbs sampler driven by n(0) rather than II(0 I X). Once this is shown, if n(0) and n(e I X) are distant in total variation, then _Y(0’) and n(0 1 X) must be as

well.

Theorem 1. Let U”‘(C) denote the d’ t ‘b t’ IS r-1 u ion of t + 1 independent random vectors each uniformly distributed on C. For X E C,

- (d - 1) log@&) and b = 2ra2/( r( i( d + 1)))(2’(d-1)),

II -q @, . . . , et) - U'+'(C) II < 2td 5 (b (/T +a))(+“‘z_

(3.1)

232

Volume 17. Number 3 STATISTICS & PROBABILITY LElTERS 18 June 1993

Proof. Consider a second Markov chain $. Let $’ be uniformly distributed on C. If $L”l is obtained from $’ via d updates in Gibbs sampling driven by the prior II(e), then it is straightforward to show that

5%1cr”, . . . , I,/J~) is U’+‘(C). To prove the theorem we need only construct (0’, . . . , f3’) and ($O,. . . , 4,‘) on

the same probability space such that

p((eO,.. .,et) z ($” ,..., IL’)) < 2td (” (J”h”-2 +a))‘d-‘)‘2. y

We suppose (Go,. . . , $I> are defined and construct (f3’, . . . ,19’> from ($O,. . . , 4’) and independent random variables as needed. Set 19’ = $‘, since both initial distributions are II. Inductively suppose we

have (ei, ,..., ei_,, ei-I,..., 0:-‘> and we need to construct 0;. If co:, . ..,e;_,, ei-l,. . .,86-V f <I&.., &_I, IcIy,..., I&‘), then generate 0; using random variables independent of Go,. . . , $‘. In

<ef, . . . , e;_,, q-l,. . . , ep> = this case the processes have already uncoupled. However, if

c*;,..., I&-,, l&-l,..., $6-1) then we proceed as follows.

For XE C, f(ek I e;, . . . , I$_,, e;;‘,, . . . , ~9:~ ‘, Xl is proportional to

where superscripts on 8 have been dropped. Let p denote

d e-Z,tk((X,-0,)/u)2/2 ’ e-((~-Xk)/~)Z/2 de.

Integration gives p < (1 - 6)(afi>Ycd~ ‘) exp(-(2a2jp1Cjfk(X, - 0,)“). Then 0: can be generated as follows. With probability 6/(6 + p), let 13; = I,/$, which is U(0, 1). In this case the processes remain

coupled. With probability p/(p + S> choose f3; from the the N(X,, a*> density conditional on 0: being

in (0, 1). The probability that coo,. . . , f3’> f (I)‘, . . . , I,!J’> is bounded by, in obvious notation,

$)z[(B; ,..., ei_,, e;-l,..., e6-1)=(1Cr;,...,$Lf-1, ~t~‘~...~~6-‘)].

We can write pik as a function of I/I-’ and $’ on the indicated set, drop the indicator functions, and use

the fact that the random variable I&, i = 0,. . . , t, k = 1,. . . , d, are i.i.d. U(O, 1) to obtain

P((e0,...,8t)#(JIo,...,ICIt))~tdEP11/(P11+~).

Since P (0 <pI1/(pI1 + 6) < 1) = 1, if we show P (plI/(pll + 8) > r> < Y, then E(pll/(p~l + 6)) G 27. Thus consider

P(p,,/6>y) GP <2a2 log[ $( &)d-l]j.

’ are i.i.d. U(0, l), this is the probability that they, as a point in IWd-‘, lie within a srnZiZ%&d~X 2,. . . , Xd). The volume of a ball of radius r in [Wd-’ is [-rr(d-“/2/T(~(d + l)>]rd-‘, so this probability is at most

(3.2)

233

Volume 17. Number 3 STATISTICS & PROBABILITY LETTERS 18June1993

We must find a small y such that (3.2) < y. In terms of a and b defined in the theorem, we require

(ba - b log y)(d-1”2 < y.

Let y = rcdP ‘)12. Then we must satisfy ab - i(d - 1)b log Y < Y. Note that log r > - l/r for r E (0, l), so it suffices to have r2 - abr - i(d - 1)b > 0. Consider the positive root of the corresponding equality. This root is

b 2(d - 1) y=-

2 b

If this root is bigger than 1, then (3.1) is trivial. If it is between 0 and 1, then it gives a satisfactory y.

The right side of (3.1) is then at the most 2tdy. 0

4. An example

In (2.1) take d = 9, (T = 0.01 and 6 = 0.01. For simplicity assume 0.03 <XL < 0.97 for i = 1,. . . ,9. This has prior probability approximately (0.94)9 A 0.57. We shall see that the Gibbs sampler does not converge in any reasonable time frame, though it appears to.

A realistic situation similar to this is the following. I have the uniform distribution on C as my prior for a set of nine proportions. Someone else does an experiment and reports X along with the claim that given 0, Xi,..., X, are independent with 2500X, - Binomial(2500, ei). I do not completely trust this person’s experimental design. With personal probability 0.99, I think their design is correct, but with probability 0.01 I think their experiment is so flawed as to say nothing about 0. Given the frequency of faulty studies nominally supported by statistics, this is not unreasonable. Here a model like (2.1) occurs, though a scaled binomial with variance depending on 0 replaces than the normal. The normal model was studied for mathematical simplicity; one could expect similar results with the binomial.

In this situation a = log(O.99/0.01) - 8 log(O.Olfi) G 34.08, b = 2rr(0.01)2/r’/4(5) L 0.000284. Thus

II -E”(0”, . . . , 0’) - U’+‘(C) II G 0.0000412t G At. On the other hand the variation distance between the prior and posterior is nearly 1. To see this we compute the probability of lJ y{) Xi - oi I < 0.03). For the values of X we consider, this is a cube centered at X lying entirely within C. Thus the prior probability of this cube is (0.06)9. For all X, by integrating (2.1) we see that the marginal density m(X) is at most 1. With this, the posterior probability of lJ T{ 1 X, - oi I < 0.03} can be bounded from below by integrating (2.1); the result is at least 0.99 (0.997)“. Thus a lower bound on the total variation distance between the prior and posterior is 0.99 (0.997)9 - (0.06)9 2 0.96. Theorem 1 now implies that 11 _Y(8’) - IT(0 1 X) (1 >

0.96 - At. So at least tens of thousands of steps are required for convergence. This process is deceptive in that empirically it appears to converge rapidly. Suppose we perform 200

independent runs of the same process each of length 50. Then with probability at least (1 - &)20” A 0.66, in all of the runs we can think of 8 and + as remaining coupled. In other words, on a set of probability about 0.66, the output of the Gibbs sampler looks just like 200 sequences of length 50 of iid uniform (0, 1>9 random vectors. It would be easy to infer rapid convergence naively from this data. Looking at one long output sequence would have an identical problem.

By increasing the dimension or decreasing (T, the probability that the two processes remain coupled can be made arbitrarily close to 1 for an arbitrarily long Gibbs sampling sequence. The value of 6 is almost immaterial since it only enters logarithmically in the coefficient a. A value of 0.01 seems reasonable, though choosing a value several orders of magnitude smaller would not significantly affect the results.

234

Volume 17, Number 3 STATISTICS & PROBABILITY LETTERS 18 June 1993

5. Discussion

It is interesting to consider why convergence to stationarity is slow in this example. The posterior is unimodal. For multimodal posteriors, convergence can certainly be slow due to the process being trapped for a long time near a local minimum. A nice example of this in combinatorial problems is Jerrum (1992). On the other hand, the posterior is not log-concave. Computer scientists have developed algorithms for integration of log-concave functions that give provably accurate results in polynomial time. There is a precise technical sense to the polynomiality of the time taken, and the current algorithms are probably not yet fast enough to be useful. See, for example, Dyer and Frieze (1991) and Applegate and Kannan (1991). In the example considered here the posterior is nearly singular with respect to the prior. This alone is not enough to cause slow convergence. If 6 were set to 0 in (2.11, then this near singularity would become stronger, yet convergence would be rapid. The parameter space is high dimensional though, as with the near prior/posterior singularity, this alone is not enough to cause slow convergence.

What seems to me be the best explanation comes from a geometric view of the posterior density. Consider the region of d + 1 dimensional space parameterized by f3,, . . . , ed, y, bounded by the hyperplane y = 0 and the manifold y = n(0,, . . . , ed 1 X). Geometrically this region looks much like a high-dimensional thumbtack; there is a head with a narrow spike extending from it. In three dimensions (with a two-dimensional prior) the analogy is clear. Due to the presence of 6 in (2.1) and the thinness of the normal tails, the head of the tack is nearly flat. Since each iteration involved in a step of the Gibbs sampler only looks in a prespecified direction, until the current position is such that the spike is in the visible direction, the sampler will wander aimlessly around the head of the tack. Once it sees the spike, with high probability the Gibbs sampler will go there and remain there a long time. Of course this analogy is not quite correct in that the posterior is unimodal, so there will be some tendency for the sampler to drift towards the spike. In this example, since the normal distribution tails are so thin, this drift is negligible. It may be useful to think of this example as being like sampling from a bimodal distribution; in either case there are two regions of space the sampler alternates between, spending a long time in each before moving to the other. Here the two regions are the head and the spike of the thumbtack. Requiring log-concavity of the density would rule out this extreme behavior.

Since the posterior is explicit in this example, it is easy to suggest corrections in this case. The real problem occurs if you really only have information on the conditionals f(0, I X, {ej, j # i)>. In this situation naive use of the Gibbs sampler is destined to have difficulties. Theorem 1 says that the set of sample paths from the Gibbs sampler in this problem and the set of sample paths from a process that simply picks points at random from the unit cube have a substantial overlap. Picking points at random is very rapidly mixing; it takes only one step to become completely random. Suppose an algorithm had the task of deciding whether to stop a Gibbs sampler, based only on the sample path observed so far. The implication of Theorem 1 here is that any algorithm that stops early with high probability when points are chosen uniformly at random must also stop early with high probability when observing the Gibbs sampler in this example. Thus no algorithm can do all that is desired; it cannot stop early for rapidly mixing chains and wait to stop for slowly mixing chains. This problem cannot be avoided by looking at several shorter paths of the Gibbs sampler with different initial positions rather than one long path.

The only solution seems to be to make use of some information about the posterior. This, unfortu- nately, leads to Gibbs samplers that are not as easy for non-experts to implement. For example, if we can find its modes by applying a hill-climbing algorithm to the posterior, then we can easily dispense with problems like this one. Cui, Tanner, Sinha, and Hall (1992) give a diagnostic for monitoring convergence and apply it to the example of this paper.

References

Applegate, D. and R. Kannan (1991), Sampling and integration of near log-concave functions, preprint.

Cui, L., M. Tanner, D. Sinha and W. Hall (19921, Monitoring

convergence of the Gibbs sampler: Further experience

235

Volume 17. Number 3 STATISTICS & PROBABILITY LETTERS 18June1993

with the Gibbs stopper, preprint, Dept. of Biostatist.,

Univ. of Rochester (Rochester, NY).

Doob, J. (1953), Stochastic Processes (Wiley, New York). Doss, H. and J. Sethuraman (19911, A study of the conver-

gence properties of successive substitution sampling based on Harris recurrence of Markov chains, preprint.

Dyer, M. and A. Frieze (19911, Computing the volume of

convex bodies: a case where randomness helps, Res. Rept.

91-104, Dept. of Math. Carnegie-Mellon Univ. (Pitts-

burgh, PA).

Gelfand, A., S. Hills, A. Racine-Poon and A. Smith (1990),

Illustrations of Bayesian inference in normal data models

using Gibbs sampling, J. Amer. Statist. Assoc. 85, 972-985.

Gelfand, A. and A. Smith (19901, Sampling-based approaches

to calculating marginal densities, J. Amer. Statist. Assoc. 85, 398-409.

Gelman, A. and D. Rubin (19921, Honest inferences from

iterative simulation, Tech. Rept., Dept. of Statist., Univ. of

California (Berkeley CA).

Jerrum, M. (1992), Large cliques elude the Metropolis pro-

cess, Random Struct. Algor. 3, 347-359. Tierney, L. (1991), Markov chains for exploring posterior

distributions, Tech. Rept. 560, School of Statist., Univ. of

Minnesota (Minneapolis, MN).

236

a slowly mixing markov chain with implications for gibbs sampling

Documents