monte carlo methods for bayesian analysis of constrained ...scs/courses/stat376/papers... ·...

Biometrika (1998), 85,1, pp. 73-87Printed in Great Britain

Monte Carlo methods for Bayesian analysis of constrainedparameter problems

BY MING-HUI CHENDepartment of Mathematical Sciences, Worcester Polytechnic Institute, 100 Institute Road,

Worcester, Massachusetts 01609, U.S.A.

[email protected]

AND QI-MAN SHAO

Department of Mathematics, University of Oregon, Eugene, Oregon 97404, [email protected]

SUMMARY

Constraints on the parameters in a Bayesian hierarchical model typically make Bayesiancomputation and analysis complicated. Posterior densities that contain analytically intrac-table integrals as normalising constants depending on the hyperparameters often makeimplementation of Gibbs sampling or the Metropolis algorithms difficult. By usingreweighting mixtures (Geyer, 1995), we develop alternative simulation-based methods todetermine properties of the desired Bayesian posterior distribution. Necessary theory andtwo illustrative examples are provided.

Some key words: Bayesian computation; Bayesian hierarchical model; Gibbs sampling; Marginal posteriordensity estimation.

1. INTRODUCTION

In this paper we consider a Bayesian hierarchical model with constrained parameterspaces. Let 6 be a p-dimensional parameter vector and let X denote a g-dimensionalparameter vector. Typically, 6 contains the parameters of interest and X is a vector of thehyper- or nuisance parameters. Let the posterior distribution be of the form

7i(6U|data) = - L ( d a t a , 0 ) * ^ r W ) * n{X)IA{X), (1)cw c(X)

where L(data, 6) is the likelihood function, n(6 \ X) and n(X) are priors,

c(X)= f n(6\X)d6, (2)Js

the constrained space S, a susbset of Rp, for 0 may or may not depend on the dataaccording to the nature of the problem, and A c R « is the support of n{X). In (1), thesupport of n{6, X | data) is

Q = S®A={(8,X):eeS,XeA}, cw= \ L(<iata,e){n(e\X)/c(X)}n(X)d6 dX,Jn

and the indicator function Is(9) = 1 if 8 e S, and Is{9) = 0 otherwise. We also define a

74 MING-HUI CHEN AND QI-MAN SHAO

pseudo-posterior density by

n*(6, X| data) = — L(data, 8)n(9\ X)Is(9)n(X)IA(X), (3)c

where

rL(d<ita, 9)n(9\X)n(X)d9dX.-1

It is easy to observe that cw = c*E**{l/c(X)}, where the expectation E** is taken withrespect to n*{6, X | data). This convention will be used throughout this paper.

Without the constraints, n(6\X) is a completely known density function, that is$RPn(0\X)d0=l. Note that c(X) is a normalising constant of the density functionn(9\X)Is(9) or the conditional probability, P{8eS\X), with respect to 7t(0|A). Therefore,0 < c(X) ^ 1. As a result of the complexity of the constrained parameter space, S, analyticalevaluation of c{X) is typically not available and c(X) often depends on the hyperparametervector X. This makes Bayesian analysis very difficult. As Gelfand, Smith & Lee (1992)noted, direct sampling from n(9, X | data) is nearly impossible. Therefore, the Gibbs sampler(Geman & Geman, 1984; Gelfand & Smith, 1990) and the Metropolis algorithm(Metropolis et al., 1953; Hastings, 1970) either cannot be directly applied or are tooexpensive.

Bayesian constrained parameter problems with normalising constants naturally arise inmodels with ordered and data-constrained parameters, in problems involving orderedmultinomial parameters and in Bayesian hierarchical models when constraints are imposedon lower-level parameters while higher-level parameters are random; see Gelfand et al.(1992) for detailed discussion, and also see § 4 for two illustrative examples. Such problemsalso arise in Bayesian inference using weighted distributions; see, for example, Larose &Dey (1996). As a result of the computational difficulties, the literature on Bayesian analysisof constrained parameter problems with normalising constants is still sparse. Often oneavoids these problems either by specifying the values of hyperparameters X or simply byignoring the normalising constant c(X), and this leads to incorrect Bayesian formulation.

In this paper, borrowing the idea of reweighting mixtures of Geyer (1995), we developalternative Monte Carlo methods for solving these problems. Note that the pseudo-posterior density n*(6, X | data) does not contain c(X), and therefore a natural choice of aproposal density for n(8, X | data) is n*(6, X | data). For many Bayesian hierarchical models,it is possible to simulate from n*(9, X | data) by using a Markov chain Monte Carlosampling algorithm as in § 4. Further, note that n*(9, X | data) will serve as a good proposaldensity if c(X) is bounded away from 0. If the constraints naturally arise from a problemand the data support the constraints, the probability of the constrained parameter spaceshould not be too small; see Chen & Deely (1996) for an example.

Using a random sample from n*(9, X | data), we develop Monte Carlo methods for com-puting the posterior quantities of interest in n{9, X | data). Thus, essentially as in Geyer(1995), one obtains samples from one distribution but wants to do integration with respectto another distribution.

This paper is organised as follows. In § 2, we show how to estimate the posteriorproperties of n(9, X | data) based on a random sample from n*(9, X | data). In § 3 we proposeMonte Carlo methods to estimate simultaneously normalising constants in order to obtain

Bayesian analysis under constraints 75

good estimators of posterior quantities of interest. Two illustrative examples are given in§ 4, with concluding remarks in § 5.

2. POSTERIOR MOMENTS AND MARGINAL POSTERIOR DENSITIES

21 . Posterior momentsThroughout § 2, we let {(0,, Xt), i = l,2,...,n} be a random sample from n*(9, X | data)

and we temporarily assume that c(X) is known.Let h(9, A) be a function of 9 and A, and assume that En\h(6, A)| < oo. Then

E*{h(9, X)} = \[ ^~- n*(8, X | data) d9 dx\ \ \ -±- n*{9, X | data) d9 dx\ \ (4)n c(X)

Note that, when h(6, X) = 8, £"{/i(0, X)} is the posterior mean of 9; when h(8,X) ={9 - E*(0)} {9 - E*{9)}\ £*{/i(0, X)} is the posterior variance-covariance matrix of 8; and,when h(9, X) = lA(9, X), E"{h(9, X)} gives the posterior probability of a set A.

Using the sample {(9{, Xt), i = 1, 2 , . . . , n}, we estimate E"{h(9, X)} by

where the weight wt is defined as

i r " 1 1 " 1 r " c(Xi)\-1

w; = —7T-\\ L, \f = i Z-, ~77~\f =c(Xj)

Thus Wj is a function of normalising constants c(X^), c(X2), • • •, c(Xn) with 0 ^ w( ̂ 1.A nice feature of (5) is that, whenever a random sample {(9h Xt), i = 1, 2 , . . . , n} is taken

and w, is computed, the same wt can be used for calculating h for all h as long as£*|/i(0, X)\ < oo. For example, one obtains /J and its posterior standard deviation using thesame w,. Finally, it is easy to observe that k is a consistent estimator of E*{h(9, X)}:

)} (7)

almost surely, as n-+oo.

2-2. Marginal posterior densitiesIf we use the importance-weighted marginal density estimation method proposed by

Chen (1994), it can be observed that a marginal posterior density is indeed an integral-type of posterior property if we choose an appropriate h in § 2 1 ; see Chen (1996) fordetailed discussion. Thus, this subsection is a special case of § 2 1 .

Write 9 = (9(Po), fy-po)). where 1 ^ p0 ^ p, 9(Po) is the vector of p0 components of 9, andfy-p^ is the vector of the remaining p — p0 components. Let ntf*^\data) be the jointmarginal posterior density at a fixed point 9*^. The support of the conditional posteriordensity of 0(Po) given 9^^ and X is denoted by

SPO^-PO), X) = {0(PO):(0(PO)> 0(-PO)> ^ S x A}.

We take

h(9X) co ( f l | f l A ) x 2 ( 8 )

76 M I N G - H U I CHEN AND Q I - M A N SHAO

where n*(6, X | data) is as given in (3) and w*(0(Po) 1 0 ^ ^ , X) is a 'weight' conditional densitygiven 0 (-Po) and X with support SÔ^^, X). It is then straightforward to show that (4)reduces to 71(0^ | data). If we use the random sample {(0(, A,), i = 1,2,... ,n}, fi dennedby (5) gives an estimate, denoted by #(0^)1 data), of nidfp^|data). Chen (1994) showedthat the optimal choice of w* in (8), in the sense of minimising the asymptotic variance,is the conditional posterior density of 0(Po) given 0<-Po) and X with respect to the desiredfull posterior distribution n{8, X | data). Therefore, following the guidelines given in Chen(1994), we can choose a good conditional density wÔÔ^^, X) that has a shaperoughly similar to that of the conditional posterior density of 0(Po) given 0(-Po) and X. Notethat the same weights, w1? w2,..., wn, are used for estimating both posterior momentsand marginal posterior densities for 6.

To obtain marginal densities for X, we write X = (X(qo), X(-qo)), where 1 < q0 < q, X(qo) isthe vector of q0 components of X, and A(_9o) is the vector of the remaining q — q0 compo-nents. The support of the conditional posterior density of X(qo) given 6 and A(_,o) is denotedby

\o(0> V<o>) = U(,o):i6' ô ) . V«o)) e S x A}.

Let A(*o) be a fixed point and denote by n(Xfqo) | data) the joint marginal posterior densityfor X{qo). By analogy with (8), we take

c(X)w*(X(qo)\6,X(_qn))m ' c{XlX)lo)(qo) n(9\X)n(X) ' W

where w*(Xiqo) \ 6, A.(_,o)) is a 'weight' conditional density of Xiqo) given 6 and A(_,o) with sup-port A,o(0, A(_,o)). Then (4) and (5) lead to 7r(A(*o)|data) and its estimate n(X*qo)\ data),respectively. Note that, if the marginal posterior density n(X(qo) \ data) is to be calculatedat k points X*qo)h for / = 1, 2 , . . . , k, then ft(X*qo) | data) requires the evaluation of normalisingconstants c{Xt) and c(A(*o)i, V t o ) i ) , where A; is partitioned as (A(^)(, A,-^;). for i = 1,2,..., n,and / = 1, 2 , . . . , k. Therefore, estimation of marginal densities for X is generally muchmore expensive than for 6.

Finally we note that, to obtain (7), we assume {(0,, Xt), i = 1, 2 , . . . , «} is a randomsample from n*(6, X \ data). However, if one uses a Markov chain Monte Carlo sample,(7) is still valid under some regularity conditions, such as ergodicity (Tierney, 1994). Alsonote that k, ^ 0 ^ , 1 data) and #(A(*o) | data) are not completely determined since thenormalising constants c(Xt) and c ^ j ^ , , X(-qo)i) are unknown. In the next section, we willdevelop Monte Carlo methods to estimate these unknown normalising constants by anauxiliary Monte Carlo simulation.

3. COMPUTING NORMALISING CONSTANTS FOR BAYESIAN ESTIMATION

Since we noted earlier that marginal posterior densities are special cases of posteriormoments, in the rest of this section we present our results only for a general h. Withobvious adjustment, the results can be directly applied to marginal posterior densityestimation.

Recent Monte Carlo based methods for estimating ratios of normalising constantsinclude bridge sampling (Meng & Wong, 1996), path sampling as given in TechnicalReports 337 and 440 from the Department of Statistics, University of Chicago, written byA. Gelman and X.-L. Meng, a data-augmentation-based method to estimate the marginallikelihood (Chib, 1995), reverse logistic regression (Geyer, 1995), ratio importance


sampling (Chen & Shao, 1997a, b) and a simulation and asymptotic approximationmethod of DiCiccio et al. (1997). Here, however, for a given Monte Carlo sample{(0f, A(), i = 1, 2 , . . . , n}, a total of n or k x n normalising constants must be evaluatedsimultaneously, and it will be inefficient or computationally expensive to estimate theseratios in a pairwise manner. Since c(X) is the normalising constant of the density functionn(61 X)IS(6) and not of n*(6, X | data), the sample from 7i*(0, X | data) cannot be directly usedto estimate c(X). Furthermore, there are infinitely many normalising constants in{c(X), X e A} and the Xt in the Monte Carlo sample can be any points in A. Our problemis therefore different from that considered in Geyer (1995), although the reweightingmixtures idea of Geyer (1995) is still of use.

To present our Monte Carlo approach for computing ratios of normalising constants,let TimniO) be a mixing density with support S and known up to a normalising constant,i.e.

, (10)

where pmix(0) is completely known. Here we use the term 'mixing' in the sense that 7tmix(0)is a certain mixture of the densities n(6\X)Is(6) such that 71^(0) will 'cover' everyn(6\X)Is(6) for XeA. We will discuss this issue in detail below. In fact, ^ ( 0 ) actuallyplays the role of an importance sampling density; Chen & Shao (1997a, b) show howto adapt importance sampling to estimate ratios of normalising constants. Let{6Y"*, I = 1, 2 , . . . , m} be random samples from 71^(0). Then ciX^/c^^ can be estimatedby

Using (11) and (5), we can approximate £*{/i(0, X)} by

K,m= ttitJi{Ot,li), (12)

where- i

Note that cmi]l need not be known because it will cancel out in the calculation, as in (6)or (13). We define

Then the following theorem gives the asymptotic properties of hnm.

T H E O R E M 1. Assume that {{6h Xt), i=\,2,...,n) and {6?*, 1=1,2,... ,m} are randomsamples from 7t*(0, >l|data) and 7tmix(0), respectively, and that

K • [{hh{6, X) - a}2u2(X)

for some 0^d0<oo. Then

0, a2), (15)


in distribution as n -* oo, where

If, in addition,

E"*[{1 +fc(0, A)}2{1 + u4(A)/c4(A)}] < oo, (17)

then the asymptotic mean squared error ofhnm is

lim nE\hn,m - E*{h{6, A)}]2 = a2. (18)n->oo

The proof of Theorem 1 is given in Research Report 666 from the Department ofMathematics, National University of Singapore, written by M.-H. Chen and Q.-M. Shao.

Equation (16) is intuitively appealing. The first term reflects the simulation error dueto (5) for estimating h using the sample {(6h A,), i = 1, 2 , . . . , n}, while the second termrepresents the simulation error due to (11) for estimating the normalising constants c{Xt)using the auxiliary sample {6^, 1=1,2,... ,m}.

From (18), a lower bound of the asymptotic mean squared error of hn<m is

lim nE\hn.m ~ E-{h(9, A)}]2 > b~^ ( ^ ' f " a)\ (19)

It can be easily observed that, if n = o(m), then do = 0 in (16) and the lower bound in (19)is achieved. Thus, estimating normalising constants c(Xt) does not asymptotically produceany additional simulation errors.

By Theorem 1, the simulation standard error (SE) of hnm can be estimated by first-orderapproximation of the asymptotic mean squared error of hn>m:

where d2 = a\ + a\,

The simulation error estimation is important since it provides the magnitude of thesimulation accuracy of the estimator hnm.

In order to obtain a good estimator /zBm, we need to select a good 71,,^. It is verydifficult to prove what rc^ is optimal in the sense of minimising the asymptotic meansquared error of hBm. However, a good 7 ^ should possess the following properties:


(a) 7 1 ^ has a heavier tail than n(91X) for all X e A,(b) Ttmjx has a shape similar to n(9\ X) and a location close to that of n(9\ X),(c) conditions (14) and (17) hold.

Property (c) guarantees (15) and (18). In (11), cm(Xt) is indeed an estimate of the ratio oftwo normalising constants c(Xi) and c , ^ by using a generalised version of importancesampling; see e.g. Chen & Shao (1997b). As discussed in Chen & Shao (1997b), properties(a) and (b) are required in order to obtain an efficient estimator of c^fc,,^. Chen &Shao (1997b) also illustrated how a bad choice of rc^ can result in an estimator ofc(^i)/cmix with a large or possibly infinite asymptotic variance.

Based on the above criteria and associated with the idea of the reweighting mixturesmethod of Geyer (1995), a natural choice for 7 1 ^ is the mixing distribution of n(9\X)IS(9)with mixing parameter X. Let G(X) be the mixing distribution of X. Then, nmiji is given by

I"oc n(9\X)Is(9)dG(X). (21)JA

If X is a vector of location parameters, then G may be chosen as a multivariate normaldistribution JV(/i,E), where /i and E are specified by the posterior mean and posteriorcovariance matrix of X with respect to n*(9, X). If X contains scale parameters, the mixingdistribution for each parameter may be chosen as a gamma distribution Ga(a, /?) or aninverse gamma distribution InGa(a, /?) with shape parameter a and scale parameter /?,where a and /J are specified by the posterior mean and variance of X with respect ton*(9, X), using method-of-moments estimates. The above choices of G capture the shapeof the desired posterior distribution of X with respect to n*, and the mixing distributionstypically have heavier tails than those of n(9 \ X)IS(9). However, if the parameter space Ais constrained, the aforementioned approaches may result in intensive computation, andG may instead be simply chosen as a discrete distribution such that the resulting rc^ hasa heavier tail than n(9\X)Is(9) for every X. For example, if {n(9\X)Is(9), X e A} is a familyof truncated Dirichlet distributions, 7 ^ may be specified as the one from the sameDirichlet family that has a heavier tail than any other member in that family. Finally, agood G may be obtained as the mixture of discrete and continuous distributions; see § 4for illustrative examples.

Note that, if nmix(9) defined by (21) is not analytically available, a method for estimatinga ratio of two normalising constants with different dimensions proposed by Chen & Shao(1997a) can be applied to obtain cm(A,) by using a Monte Carlo sample from the distri-bution n(9\X)Is(9) dG{X). To illustrate the idea, let g(X) be the probability density functionof G and let {(9?*, X?*), I = 1, 2 , . . . , m} be a Monte Carlo sample from n(9 \ X)Is(9)g(X).Then, instead of (11), ciX^/Cg^ is approximated by

1 n(9T\x{)wmix(xr\er)m% n{9Tix\Xrix)g(X?ix) ' l '

where wmix(XTxix\9?ix) is a completely known density function whose support is con-tained in or equal to A. Chen & Shao (1997a) showed that the optimal choice ofwWxO^lfl111^) is the conditional density of X™* given 9™ with respect to n(9\X)Is(9)g(X).

Finally, we comment on the independent assumption of {(9{, Xt), i = 1, 2 , . . . , n}or {{97™*, / = 1, 2 , . . . , m} in Theorem 1. When one uses Markov chain Monte Carlosampling to generate (9h X,) from n* or 9^ from Ua^iO), {(9h Xt), i = 1, 2 , . . . , «} or{05"™, / = 1, 2 , . . . , m} are dependent. Under certain regularity assumptions, such as ergod-


icity and weak dependence, the consistency and the central limit theorem of hnm still hold.The only problem is the estimation of the standard error SE(hHm) given in (20). One simpleremedy is to obtain an approximately random sample by taking every Bth iterate inMarkov chain Monte Carlo sampling, where B is selected so that the autocorrelationsare negligible with respect to their standard errors; see, for example, Gelfand & Smith(1990). Other possible approaches are to use the expensive regeneration technique inMarkov chain sampling (Mykland, Tierney & Yu, 1995) to obtain a random sample fromdifferent regeneration tours, effective sample sizes (Meng & Wong, 1996) to computeSE(hnm), and a coupling-regeneration scheme of Johnson (1998).

4. APPLICATIONS

4-1. The Meal, Ready-to-Eat modelThe Meal, Ready-to-Eat model is a constrained Bayesian hierarchical model. Chen,

Nandram & Ross (1996) considered the model that includes ten entree items. For illustra-tive purposes, we consider here only one entree item, namely, ham-chicken loaf. The mealswere purchased and then were inspected for completeness and stored at four differenttemperatures, c1 = 4°C, c2 = 21°C, c3 = 30°C and c4 = 38°C. They were then withdrawnand tested at tt = 0, t2 = 6, t3 = 12, t4 = 18, t5 = 24, t6 = 30, t-, = 36, ts = 48 and t9 = 60months. At 4°C the food was not tested at 6, 18 and 24 months, at 21°C the food was nottested at 6 months, and at 30°C and 38°C the food was not tested after 36 and 24 months,respectively. Upon purchase the foods were immediately tested at room temperature(21°C), and thus data were only available for time 0 at room temperature. At each tempera-ture-time combination each food item was served to 26 untrained, randomly chosen sub-jects who judged its acceptability on a nine-point rating scale running from 1 = dislikeextremely to 9 = like extremely. The primary goal in the study of Chen et al. (1996) wasto investigate the food shelf-life. However, in this example we consider modelling the meanrating scores only. Note that the panelists were used on only 23 temperature-time combi-nations. By design, we have a total of 4 missing combinations, which, therefore, are notincluded in the analysis.

Let Yt]l be the score given by the /th panelist for ham-chicken loaf at temperaturect and withdrawn at time tj. The Yt]l are independent and identically distributed withP(Y{Jl = k\pij) = plJk, w h e r e pij = (pin,Pij2, • • • ,Pij9) a n d E ^ P y * - ^ for i = l , . . . , 4 ,j = 1 , . . . , 9 and k = 1 , . . . , 9.

The mean score is coy = Hl=l kplfli. For temperature ch as the quality of food deteriorateswith time, we have the constraint

< y y ^ . ; - i 0 = 2, 3 , . . . , 9), (23)

and, for withdrawal time tj, as the quality of food deteriorates with temperature, we havethe constraint

oiij^co^.j (i = 2,3,4). (24)

In (23) and (24) there is one adjustment, that is, co21 ^ (oi2 for i = 2, 3,4. For the missingtemperature-time combinations there are obvious adjustments to constraints (23) and (24).

To facilitate the use of the Bayesian hierarchical model, we introduce latent variablesZijh where

YtJt = k (a^^Zij^a^k^H,...^) (25)


and Zi}l\ei}~N{6i}, a2), independently, for i = 1, 2, 3,4 and; = 1, 2 , . . . , 9. In (25) we takeflo = — °o. «i = 0, a8 = 1, ag = oo, and 0 ^ a2 < . . . ^ fl7 ̂ 1 are specified by using the similarpastries data. The values are a2 = 0-0986, a3 = 0191, a4 = 0-299, a5 = 0-377, a6 = 0-524 anda7 = 0719 (Chen et al, 1996).

It is easy to show that

where O is the standard normal cumulative probability function. Since coi} is an increasingfunction of 9tJ, it follows that the constraints (28) and (24) on the mean scores are equival-ent to the constraints

0y<0.V,-i 0 = 2, 3 , . . . , 9), (26)

0 ^ 0 < - u U = 2,3,4). (27)

For the missing temperature-time combinations, constraints (26) and (27) require the sameadjustment as for constraints (23) and (24).

Instead of using a measurement error model 9iJ = X + eiJ, as in Chen et al. (1996), weconsider a temperature-effect additive model of the form

0y = A» + fiy, (28)

subject to constraints (26) and (27), where ey~7V(0, (52), independently, and b2 is thevariance of ey, for i = 1, 2, 3, 4 and j = 1, 2 , . . . , 9. For the room temperature, in order toincorporate the adjustment in constraints (26) and (27), set 02i = ^i + en- We also addthe constraints

Aj £s X2 Js X3 Ss XA (29)

to ensure consistency with (27).We take a diffuse prior for the At over the constrained parameter space defined by (29)

and we choose priors for a2 and 82 as

ritf-tf, W2~ti- (30)In (30), C, n, « and ft are to be specified by using the pastries data, and the values areC = 16-88, t] = 0-83, a = 4-60 and p = 001.

Let S be the constrained parameter space associated with (26) and (27). Also, let 9 ={Oij, a2) and k = {kl,... ,XA, S2). Therefore, 9 is a 24-dimensional vector, p = 24, and X isa 5-dimensional vector, q = 5. Finally, we denote by fy-^j the vector of all elements of 6except for a2.

As a result of constraints (26) and (27) and the temperature-effect additive model (28),the prior distribution for 9 given X depends on the normalising constant

(31)

where <p is the standard normal probability density function and the (i, j) are taken corre-sponding to the non-missing temperature-time combinations. Analytical evaluation of c(X)does not appear possible.


Let Z = {Zijj). Then the posterior distribution is

where indicator function I{Yi}l = k)=\\iYin = k, and 0 otherwise, T C ^ - ^ , | k) is determinedby (28), n(k) = TT^, . . . , A4)7i((52|a, /?), re^,..., /I4)oc 1, subject to constraint (29), and

C, 1) and 7r(<52|a, /?) are given in (30). The pseudo-posterior distribution is

7t*(Z,0,A|data)ocn fl i l l1,1 1 = 1 U=i

As in Chen et al. (1996), we used the Gibbs sampler to generate Z, 6 and k from thepseudo-posterior distribution n*(Z, 6, k | data). Using the several convergence diagnosticprocedures recommended by Cowles & Carlin (1996), we found that the Gibbs samplerpractically converges within 500 iterations. We also found that the autocorrelations werenegligible when we took every fifteenth of the Gibbs iterates. As sampling from the pseudo-posterior distribution n*(Z, 9, k | data) is much cheaper than computing normalising con-stants, every 15th Gibbs iterate after convergence was used for estimation and 30 000Gibbs iterates produced an approximately random sample {(Zt, 9h kt), i = 1, 2 , . . . , n} ofsize n = 2000. In this application, (ku ..., kA) is a vector of location parameters and S2 isa scale parameter. In order to estimate c(kt) simultaneously for j = 1, 2 , . . . , n, we chose amixing distribution G(k) = G(ku ..., kA)G(82). In view of the constraints (29), for the sakeof computational simplicity G(A1,...,A4) was chosen as a degenerate distribution at(k*,..., k*), where A* is the posterior mean of ki with respect to n*(Z, 6, k \data), forj = 1, . . . , 4. For the scale parameter S2, we chose G(S2) to be an inverse gamma distri-bution with a probability density function ^2)oc(l/52) ( a* + 1)e~p*/a2 for <52 > 0, where a*and fl* are specified by the posterior mean and the posterior variance of S2 with respectto n*(Z, 0,A|data) for j = 1 , . . . , 4 using method-of-moments estimates. With the abovechoice of G and using (21), we have

l 1 + J 2V j (32)

where the 0y" are subject to constraints (26) and (27). For this particular dataset, weobtained a* = 7-65 and /?* = 0-011. Therefore, nmix(6

mi*) is a truncated multivariate t distri-bution, which has a heavier tail than the truncated multivariate normal distributionn{6{-^\k) for every k. We used an algorithm of Geweke (1991) to generate a randomsample {Of*, 1=1,2,... ,m}, with m = 10000, from 71^. Then, using (11), (12) and (20)and choosing appropriate h's, we obtained the posterior mean rating scores, the simulationstandard errors and their posterior standard deviations that are reported in Table 1.

As expected, the temperature-effect additive model leads to slightly higher (lower) meanrating scores at low (high) temperatures than does the measurement error model. Forexample, the estimated mean rating scores for measurement error model are 5-785, 5-586,5-487, 5-378, 5-230 for 12, 30, 36, 48, 60 months at temperature 4°C and 5-470, 5-264, 5123,4-965 for 6, 12, 18, 24 months at temperature 38°C. Note that the simulation standarderrors of the posterior means are small and within 5% of the posterior standard deviations.


Table 1. The Meal, Ready-to-Eat model: Example estimates of the mean rating scores

Temper-ature 0

4°C —

21°C 5-984(O0046)(0131)

30°C —

38°C —

Time (months)24 306 12 18

— 5-854 — —— (O0040) — —— (0-108) — —

— 5-739 5-651 5-581— (0-0038) (0-0037) (O0035)— (O100) (0097) (0095)

5-727 5-485 5-352 5-204(O0057) (0-0052) (O0044) (O0050)(0148) (0123) (0119) (0129)

5-346 5139 4-992 4-835(O0067) (O0065) (O0059) (O0076)(0169) (0149) (0146) (0172)

36 48 60

5-683 5-590 5-485 5-344(O0037) (O0034) (O0038) (0-005)(O104) (0099) (O105) (0138)

5-499 5-428 5-323 5137(O0034) (O0036) (O0040) (O0051)(0094) (0099) (0111) (0149)

5082 4-811 — —(O0053) (O0082) — —(0142) (O206) — —

Top entry, mean; middle entry, simulation standard error, bottom entry, posterior standard deviation.

Lastly, we obtained the marginal posterior densities for 02i> 2̂6 a n d #45- We chosew*(6i'f |0jf(_9(/j /), Z,, At) to be the conditional density of 0 ;y, given the other parameters,where 0i(_9(/y/) is the vector of all elements of 6t except for d^y. Then, for example, themarginal density of 02i is estimated by

£(021|data)= £( = 1

40

30

I 20Q

10

' I 1 " ! " 1 ! " 1 !03 04 05 06

eFig. 1. The Meal, Ready-to-Eat model example: Marginalposterior densities plots for 02i (solid line, right), 916

(dotted line, middle) and 04S (dashed line, left).


where

^ = 61(5? +of/36)-1, Z21t, = ^ £ Z 2 1 I > £JO ( = 1

and af = max(013 ,, 023ii, 032,;). The results are presented in Fig. 1. Note that in Fig. 1 theposterior means (posterior standard deviations) are 0-500 (0019) for 021, 0-423 (0013) for626 and 0-346 (0-022) for 045. Figure 1 also shows that the marginal densities are unimodaland approximately symmetric.

4-2. Job satisfaction exampleAs another example, we consider the job satisfaction example (Agresti, 1990, pp. 20-1).

The dataset is given in Table 2.

Table 2. Job satisfaction example: Cross-classification of jobsatisfaction by income

Income (US$)

<60006000-15 00015 000-25000

>25 0OO

i

1234

Verydissatisfied

; = i

2022137

Job satisfactionLittle

dissatisfied

7 = 2

24382818

Moderatelysatisfied

; = 3

801048154

Verysatisfied

; = 4

8212511392

Let nt = (nn, ni2, nl3, ni4), where ntj is the cell count corresponding to income level i andjob satisfaction level; for i, j = 1, 2, 3, 4. Then n, has a multinomial distribution:

where 0, = (0(1, 0a, 6a, 6i4) and E j = 1 0 y = 1. We write 6 = {6lt 02) 03, 04). Thus the likeli-hood function is

L(data, 0) = n (T.U "y)! A ^7- (33)

As discussed in Agresti (1990), there is a tendency for low income to occur with lowjob satisfaction and high income with high job satisfaction. Therefore, we naturally havethe following constraints: within each income level,

Oa^ei2**Oa^d» 0 = 1,2,3,4); (34)

for job dissatisfaction,

0u^02i^03i^04i» dii + 912^d2i + 922'^631 + 632'^641 + 642; (35)

and, for job satisfaction,

$14 ^ $24 ^ $34 ̂ $44> ^13 + $14^23 + ^ ^24 ^ ^33 + ^34 ^ $43 + ^44- (36)

Let S be the resulting constrained parameter space. We take priors 6t\A~ Dir(A), for i =1,2,3,4, subject to constraints (34), (35) and (36), where k = (klt ) . 2 , ?.3, ?.4). Therefore,


we have

y n ^ (37)

and c(X) = \s n{6 \ A) dO. We choose a flat prior for X:

< » , i = l ,2,3,4), (38)

where the u, are specified. We should not give too much weight to large values of the A,.Since we choose a flat prior for X, we should not take the u{ too large. For illustrativepurposes, we specified z^ = 7-5, u2 = 10, u3 = 15 and u4 = 20 in our calculation. We usedthe Gibbs sampling algorithm to draw 9 and X from the pseudo-posterior n*(6, X | data).Note that the conditional distribution of 9U given the other parameters is a truncatedbeta, from which it is easy to sample, while the conditional density of A, given the othersis log-concave, which can be handled by the adaptive rejection sampling algorithm ofGilks & Wild (1992). As with the previous example, we checked convergence of the Gibbssampler and autocorrelations of Gibbs iterates. It was found that the autocorrelationswere negligible when we took every fifth Gibbs iterate. Thus, 12 500 Gibbs iterates afterconvergence produced an approximately random sample {(9h A(), i = 1, 2 , . . . , n) of sizen = 2500. In order to estimate c(A() simultaneously, the complexity of n(9 \ X) led us to usesimply

Note that {n(91X), 1 ̂ Xt < u,, i = 1 , . . . , 4} is indeed a family of truncated Dirichlet distri-butions and nmix(^imx) is a member of this family corresponding to X, = 1 for i = 1 , . . . , 4.It can be easily observed that TtCOiX(9mix) n a s a heavier tail than n(9\X) for all X with1 ^ A, ^ M, (i = 1, 2, 3,4) and it can also be proven that, with the above Ttn^fl™1*), con-ditions (14) and (17) hold as long as the second moment of h(9, X) with respect to n*exists. Then, we generate a random sample {9^, 1=1,2,... ,m} with m = 50 000 fromTimi,. Using (11) and (12), we obtained the posterior means and standard deviations for9 and X that are presented in Table 3. The simulation standard errors of the posteriormeans were obtained but are not reported here. They all were within 5% of the posteriorstandard deviations. Note that the Bayesian estimates of 0y given in Table 3 are very closeto the maximum likelihood estimates. However, if one ignores the normalising constantc(X) and treats c{X) as a constant independent of X, the 'posterior means' of 9 and X areinaccurate. For example, if we ignore c(X), the estimates (posterior standard deviations)of A, for i= 1, 2, 3,4 are 2-983 (0911), 5070 (1-343), 12-434 (1-869) and 16-698 (1-495).

Table 3. Estimates of the job satisfaction cell probabilitieswith posterior standard deviations in parentheses

i 9a 6n Qa 0I4

1 0-103(0012) O130 (0-015) 0-362(0-016) 0404(0016)2 0078(0010) 0128(0014) 0357(0020) 0438(0018)3 0063(0010) 0113(0016) 0346(0023) 0479(0021)4 0045(0012) 0096(0017) 0326(0027) 0534(0028)

Aj X2 A3 /<3-465(1-064) 3-563 (1-389) 12-778(2-236) 14045(3-825)


It is interesting to conduct a sensitivity study of prior specifications: such a study hasnot been done before mainly because of the computational difficulty. Let n1(6)ccl for9 e S and n2(6, X) = n{9 \ X)n(X), where n{9 \ X) and n(X) are defined in (37) and (38). Denoteby mx(data) and m2(data) the corresponding marginal likelihoods with respect to nl andn2, and let r be their ratio. Then

mt(data) c~\\, 1,1,1) fsL(data, 9) d9r m2(data) J s JA L(data, 9){n(91 X)/c(X)}n(X) d9 dX'

where L(data, 9) is given in (33) and A = {X: 1 < Xt ^ ut for i = 1, 2, 3,4}. Since in (39) thenumerator contains a 16-dimensional integral and the denominator has a 20-dimensionalintegral, we are dealing with a problem of evaluating a ratio of two normalising constantsfor densities of different dimensions. We apply ratio importance sampling of Chen & Shao(1997a) to compute r. In order to use the available random sample {(#,, X{), i = 1, 2 , . . . , n},it is natural to choose n*(9, X | data) as a ratio importance sampling density. Based on(11), the estimator of r is

1,1, 1)^(0,,/l,|data)}"1

{^(0,, ,1,1 data)}"1 'where w**(X\9) is an arbitrary conditional density of X given 9 defined on A. Note thatc(l, 1,1,1) is indeed the normalising constant of 7 ^ . Simplification of (40) leads to

r =

The simulation standard error based on the first-order approximation of var(?) is thusgiven by

/n\nl=1

fl " I"1

-KU^)}"1

L"i=i J

(Chen & Shao, 1997a). For simplicity, we chose w**(X\9) = n(X). Using the same randomsamples, {{9t, Xt), i = 1, 2 , . . . , n} and {9f*, I = 1, 2 , . . . , m}, we obtained ? = 00153 andSE(?) = 00028. The small value of f indicates that the marginal likelihood is sensitive tothe prior specification and n2 is better than nx in the sense of maximising marginallikelihood.

5. CONCLUDING REMARKS

In this paper, we develop a Monte Carlo method to solve the problems with normalisingconstants that naturally arise in Bayesian hierarchical modelling (Lindley & Smith, 1972)when parameter spaces are constrained. Our method is essentially an extension of impor-tance sampling; see, for example, Geweke (1989). First, we temporarily assume that theimportance weights w; in (5), which are the function of c(X)'s, are known. Then, we estimatew,'s by an auxiliary Monte Carlo sample from a mixing distribution Tt,,̂ . Although inthis paper we assume that X has a continuous distribution, our method can be easilyextended to the cases where n(X) is a discrete distribution. Moreover, our method can beapplied to the sensitivity study of prior specification, as in § 4-2, which will be useful inrobust Bayesian analysis.


ACKNOWLEDGEMENT

The authors thank the editor, the executive editor and the two referees for their helpfulcomments and suggestions, which led to a significant improvement. Dr Chen's researchwas partially supported by a National Science Foundation grant.

REFERENCES

AGRESTI, A. (1990). Categorical Data Analysis. New York: Wiley.CHEN, M.-H. (1994). Importance-weighted marginal Bayesian posterior density estimation. J. Am. Statist.

Assoc. 89, 818-24.CHEN, M.-H. (1996). Markov chain Monte Carlo sampling for evaluating multidimensional integrals with

application to Bayesian computation. In Proc. Bayesian Statist. Sectn pp. 95-100. Washington, DC: Am.Statist Assoc.

CHEN, M.-H. & DEELY, J. J. (1996). Bayesian analysis for a constrained linear multiple regression problemfor predicting the new crop of apples. J. Agric. Biol. Envir. Statist. 1, 467-89.

CHEN, M.-H., NANDRAM, B. & Ross, E. W. (1996). Bayesian prediction of the shelf-life of a military rationwith sensory data. J. Agric. Biol. Envir. Statist. 1, 377-92.

CHEN, M.-H. & SHAO, Q.-M. (1997a). Estimating ratios of normalizing constants for densities with differentdimensions. Statist. Sinica 7, 607-30.

CHEN, M.-H. & SHAO, Q.-M. (1997b). On Monte Carlo methods for estimating ratios of normalizing constants.Ann. Statist. 25, 1563-94.

CHIB, S. (1995). Marginal likelihood from the Gibbs output J. Am. Statist. Assoc. 90, 1313-21.COWLES, M. K. & CARLIN, B. P. (1996). Markov chain Monte Carlo convergence diagnostics: a comparative

review. /. Am. Statist. Assoc. 91, 883-904.DICICCIO, T. J., KASS, R. E., RAFTERY, A. & WASSERMAN, L. (1997). Computing Bayes factors by combining

simulation and asymptotic approximation. J. Am. Statist. Assoc. 92, 903-14.GELFAND, A. E. & SMITH, A. F. M. (1990). Sampling based approaches to calculating marginal densities.

J. Am. Statist. Assoc. 85, 398-409.GELFAND, A. E., SMITH, A. F. M. & LEE, T. M. (1992). Bayesian analysis of constrained parameter and

truncated data problems using Gibbs sampling. J. Am. Statist. Assoc. 87, 523-32GEMAN, S. & GEMAN, D. (1984). Stochastic relaxation, Gibbs distributions and the Bayesian restoration of

images. IEEE Trans. Pat. Anal. Mach. Intel. 6, 721-41.GEWEKE, J. (1989). Bayesian inference in econometrics models using Monte Carlo integration. Econometrica

57, 1317-40.GEWEKE, J. (1991). Efficient simulation from the multivariate normal and Student-t distributions subject to

linear constraints. In Computing Science and Statistics: Proceedings of the Twenty-Third Symposium on theInterface, Ed. E. M. Keramidas, pp. 571-8. Fairfax Station, VA: Interface Foundation of North America Inc.

GEYER, C. J. (1995). Estimation and optimization of functions. In Markov Chain Monte Carlo in Practice, Ed.W. R. Gilks, S. Richardson and D. J. Spiegelhalter, pp. 241-58. London: Chapman and Hall.

GILKS, W. R. & WILD, P. (1992). Adaptive rejection sampling for Gibbs sampling. Appl. Statist. 41, 337-48.HASTINGS, W. K. (1970). Monte Carlo sampling methods using Markov chains and their applications.

Biometrika 57, 97-109.JOHNSON, V. E. (1998). A coupling-regeneration scheme for diagnosing convergence in Markov chain Monte

Carlo algorithms. J. Am. Statist. Assoc. 93. To appear.LAROSE, D. & DEY, D. K. (1996). Weighted distributions viewed in the context of model selection: a Bayesian

perspective. Test 5, 227-46.LJNDLEY, D. V. & SMITH, A. F. M. (1972). Bayes estimators for the linear model (with Discussion). J. R.

Statist. Soc. B 34, 1-41.MENG, X.-L. & WONG, W. H. (1996). Simulating ratios of normalizing constants via a simple identity: a

theoretical exploration. Statist. Sinica 6, 831-60.METROPOLIS, N., ROSENBLUTH, A. W., ROSENBLUTH, M. N., TELLER, A. H. & TELLER, E. (1953). Equations

of state calculations by fast computing machines. J. Chem. Phys. 21, 1087-92.MYKLAND, P., TIERNEY, L. & Yu, B. (1995). Regeneration in Markov chain samples. J. Am. Statist. Assoc.

90, 233-41.TIERNEY, L. (1994). Markov chains for exploring posterior distributions (with Discussion). Ann. Statist. 22,

1701-62.

[Received March 1996. Revised July 1997]

monte carlo methods for bayesian analysis of constrained ...scs/courses/stat376/papers... ·...

Documents