arxiv:1203.3816v1 [astro-ph.im] 16 mar 2012web.ipac.caltech.edu/staff/fmasci/home/astro_refs/... ·...

arX

iv:1

203.

3816

v1 [

astr

o-ph

.IM]

16 M

ar 2

012

Mon. Not. R. Astron. Soc.000, 1–17 (2012) Printed 20 March 2012 (MN LATEX style file v2.2)

Computational statistics using the Bayesian Inference Engine

Martin D. Weinberg11 Department of Astronomy, University of Massachusetts, Amherst MA 01003-9305

ABSTRACTThis paper introduces the Bayesian Inference Engine (BIE),a general parallel-optimised soft-ware package for parameter inference and model selection. This package is motivated by theanalysis needs of modern astronomical surveys and the need to organise and reuse expensivederived data. I describe key concepts that illustrate the power of Bayesian inference to addressthese needs and outline the computational challenge. The techniques presented are based onexperience gained in modelling star-counts and stellar populations, analysing the morphologyof galaxy images, and performing Bayesian investigations of semi-analytic models of galaxyformation. These inference problems require advanced Markov chain Monte Carlo (MCMC)algorithms that expedite sampling, mixing, and the analysis of the Bayesian posterior distribu-tion. The BIE was designed to be a collaborative platform forapplying Bayesian methodologyto astronomy. By providing a variety of statistical algorithms for all phases of the inferenceproblem, a user may explore a variety of approaches with a single model implementation.Indeed, each of the separate scientific investigations above has benefited from the solutionsposed for the other investigations, and I anticipate that the same solutions will be of generalvalue for other areas of astronomical research. Finally, toprotect one’s computational invest-ment against loss any equipment failure and human error, theBIE includes a comprehensivepersistence system that enables byte-level checkpointingand restoration. Additional technicaldetails and download details are available fromhttp://www.astro.umass.edu/bie.The BIE is distributed under the GNU GPL.

Key words: methods: data analysis - methods: numerical - methods: statistical - astronomicaldata bases: miscellaneous - virtual observatory tools

1 INTRODUCTION

Inference is fundamental to the scientific process. We may broadlyidentify two categories of inference problems: 1)estimation—finding the parameter of a theory or model from data; and 2)hy-pothesis testing—determining which theory, indeed if any, is sup-ported by the data. Astronomers increasingly rely on numericaldata analysis, but most cannot take full advantage of the powerafforded by present-day computational statistics for attacking theinference problem owing to a lack of tools. This is especially crit-ical when data comes from multiple instruments and surveys.Thedifferent data characteristics of each survey include varied selec-tion effects and inhomogeneous error models. Moreover, theinfor-mation content of large survey databases can in principle determinemodels with many parameters but exhaustive exploration of param-eter space is often not feasible.

These classes of estimation problems are readily posed byBayesian inference, which determines model parameters,θ, whileallowing for straightforward incorporation of heterogeneous selec-tion biases. In the Bayesian paradigm, current knowledge about themodel parameters is expressed as a probability distribution calledtheprior distribution, π(θ). This is the anticipated distribution ofparameters for the postulated modelbeforeobtaining any measure-ments. This should include one’s understanding of the modelpa-

rameters in their theoretical context. When new dataD becomesavailable, the information content is expressed asP (D|θ), the dis-tribution of the observed data given the model parameters. This willbe familiar to some as the classicallikelihood function,L(D|θ).This information is then combined with the prior to produce anupdated probability distribution called theposterior distribution,P (θ|D). Bayes’ Theorem defines this update mathematically:

P (θ|D) =π(θ)P (D|θ)

∫

π(θ)P (D|θ) dθ . (1)

Many will recognise this as the multiplicative rule for conditionalprobability. Combined with the concept of sample spaces, measuretheory, and Monte Carlo computation, Bayes theorem provides arich framework for the quantitative investigation of a widevarietyof inference problems, such as classification and cluster analyses,which broadly extends the two groups described above. Latersec-tions will illustrate the importance and utility of explicit quantifica-tion of the prior information.

Why use the Bayesian framework? To begin, the Bayesian ap-proach unifies both aspects of the inference problem: estimationand hypothesis testing. For example, given a galaxy image and sev-eral families of brightness profiles, we would like to determine boththe distribution of parameters for each family and which family isbest supported by the data. A classical analysis would report the

c© 2012 RAS

http://arxiv.org/abs/1203.3816v1

http://www.astro.umass.edu/bie

2

Maximum Likelihood (ML) estimate for each model using aχ2-type statistic (Pearson 1900) and prefer the fit with the lowest valueof χ2 per degree of freedom. However, theχ2 statistic will growwith sample size in the presence of measurement errors. Thisleadsto the well-known over-fitting problem where the Pearson-typeχ2

test will reject the correct distribution in favour of one which betterdescribes the deviations caused by the measurement errors.Thiswell-known issue may be treated in a variety of ways, but theBayesian approach naturallyprefers the model with the smallestnumber of dimensions that can explain the data distributionthroughthe specification of the prior information. The Bayesian approachfurther emphasises that model comparison problems must dependon the prior distribution. See§2.2 for more details.

The computational complexity of applying equation (1) di-rectly grows quickly with the number of model parameters andbecomes intractable before the volume of currently available largedata sets is reached. However, Monte Carlo algorithms basedonMarkov chains for drawing samples from the posterior distribu-tion promise to make the Bayesian approach very widely appli-cable (e.g. see Robert & Casella 2004). In turn, the application ofBayesian methods in cosmology and astrophysics has flourishedover the past decade, spurred by data sets of increasing sizeandcomplexity (Trotta 2008). Once a scientist can determine the poste-rior distribution, rigorous credible bounds on parametersand pow-erful probability-based methods for selecting between competingmodels and hypotheses immediately follow. This statistical ap-proach is superior to those commonly used in astronomy because itmakes more efficient use of all the available information andallowsone to test astronomical hypotheses directly. To realise this promisefor astronomical applications, we need a software system designedto handle both large data sets and large model spaces simultane-ously.

Beginning in 2000, a multidisciplinary investigator team fromthe Departments of Astronomy and Computer Science at UMassdesigned and implemented the Bayesian Inference Engine1, aMarkov chain Monte Carlo (MCMC) parallel software platformfor performing statistical inference over very large data sets. Wefocused on probability-based Bayesian statistical methods becausethey provide maximum flexibility in incorporating and usingallavailable information in a model-data comparison. For example,multiple data sources can be naturally combined and their selec-tion effects, which must be specified by the data provider to obtaina meaningful statistical inference, are easily incorporated. In thisway, the BIE provides a platform for investigating inference usingthe virtual observatory paradigm.

This paper has multiple goals. I begin by introducing in§2the concepts in Bayesian inference that illustrate its power as aframework for parameter estimation and model selection foras-tronomical problems. This power, not surprisingly, comes with sig-nificant computational challenges that informs our design for theBIE; the features of the package are described in Appendix A.Ithen describe and motivate our choice of MCMC algorithms (§3);the BIE allows additional algorithms to be straightforwardly addedas needed. This is followed by a brief summary of BIE-enabledresearch in§4. I summarise in§5.

1 Seehttp://www.astro.umass.edu/bie for detailed descrip-tion and download instructions.

2 WHAT DO ASTRONOMERS WANT AND NEED?

2.1 Parameter estimation

Many astronomical data analysis problems are posed as parameterestimates. For example: 1) one measures a spectral energy distribu-tion of an object and would like estimate its temperature; or2) onemeasures the flux profile of a disk galaxy and would like to estimateits scale length. In these problems, one is asserting that the under-lying model is true and testing the hypothesis that the parameter,temperature or scale length, has a particular value.

Bayesian inference approaches these problems with the fol-lowing three steps, reflecting the standard practice of the scientificmethod: 1) numerically quantify a prior belief in the hypothesis; 2)collect data that will either be consistent or inconsistentwith thehypothesis; 3) compute the new confidence in the hypothesis giventhe new data. These steps may be repeated to achieve the desireddegree of confidence. A clever observer will design campaigns thatrefine confidence efficiently (i.e., that makes the confidencehigh orlow). In the context of our simple examples, one may believe thatthe spectral energy distribution is that of an M-dwarf star and one’sprior belief is then a distribution of values centred on 2000K. Aftermeasuring the spectral energy distribution, the prior distribution oftemperature is combined with the probability of observing the datafor a particular temperature, to get a refined distribution of the tem-perature of the object. Notice that this procedure does not result in asingle value. Rather, the posterior probability distribution is used toestimate a credible interval (also known as a Bayesian confidenceinterval). Of course, credible intervals and regions are only a simplesummary of the information contained in the posterior distribution.Unlike classical statistics, Bayesian inference does not rely on asignificance evaluation based on theoretical or empirical referencedistributions that are valid in the limit of very large data sets. Ratherit specifies the probability distribution function for the parametersexplicitly based on the data at hand.

A prime motivation for the BIE project is the thesis that thepower of expensive and large survey data sets is underutilised bytargeting parameter estimation as the goal. To illustrate this, letus consider the second example above: estimating the scale lengthof a disk. A standard astronomical analysis might proceed asfol-lows. One determines the posterior probability distribution for scalelengths for some subset of survey images. Alongside scale length,one determines other parameters such as luminosity, axis ratios, orinclinations, and possibly higher moments such as the asymmetry.The scale length with maximum probability becomes thebest es-timateand is subsequently correlated with some other parameterof interest, luminosity or asymmetry, say. Then, any correlation isinterpreted in the context of theories of galaxy formation and evolu-tion. Observe, that in the first step, one is throwing out muchof theinformation implicit in the posterior distribution. In particular, theluminosity estimate is most likely correlated with the scale-lengthestimate. If one were to plot the posterior distribution in these twoparameters, one might find that the distribution is elongated in thescale-length–asymmetry plane, possibly in the same sense as theputative correlation! In other words, the confidence in the hypoth-esis of a correlation should include the full posterior distributionof parameter estimates, not just the maximum probability estimate.See§4.2 and Figure 3 for a real-world example.

Moreover, this scenario suggests that one is using disk scalelength and asymmetry as a proxy for testing a hypothesis about diskevolution or environment. These results might have been more re-liable if the observational campaign had been designed to enable a

c© 2012 RAS, MNRAS000, 1–17


Bayesian Inference Engine 3

hypothesis test, not a parameter estimate, from the beginning. Thisleads naturally to the following question.

2.2 Which model or theory is correct?

This question is a critical one for the scientific method. As-tronomers typically do not address it quantitatively butwant to doso. I will separate the general question “which model is correct?”into two: 1) “does the model explain the data?”, thegoodness-of-fitproblem; and 2) “which of two (or more) models better explainsthe data?”, themodel selectionproblem. Let us begin with (1) anddiscuss (2) in the next section.

Suppose one has performed a parameter estimation and de-termined the parameter region(s) containing a large fraction of theprobability. Before making any conclusions from the application ofa statistical model to a data set, an investigator should assess the fitof the model to make sure that the model can explain adequatelythe important aspects of the data set.Model checking, or assessingthe fit of a model, is a crucial part of any statistical analysis. Seri-ous misfit (failure of the model to explain important aspectsof thedata that are of practical interest) should result in the replacementor extension of the model. Even if a model has been assumed to befinal, it is important to assess its fit to be aware of its limitationsbefore making any inferences.

The posterior predictive check (PPC) is a commonly-usedBayesian model evaluation method (e.g. Gelman et al. 1995, Chap.6). It is simple and has a clear theoretical basis. To apply themethod, one first defines a set of discrepancy measures. A dis-crepancy measure, like a classical test statistic, measures the dif-ference between an aspect of the observed data set and the theoret-ically predicted data set. LetM denote the model under considera-tion. Practically, a number of predicted data sets are generated fromP (D|θ∗,M) with θ

∗ selected from the posterior distribution. Anysystematic differences between the observed data set and the pre-dicted data sets indicate a potential failure of the model toexplainthe data. For example, one may use the distribution of a discrepancymeasure based on synthetic data generated from the posterior dis-tribution to estimate a Bayesian p-value for the true data under themodel hypothesis. The p-value in this context is simply the cumu-lative probability for the discrepancy statistic. A p-value in the tailsof the predicted discrepancy-measure distribution suggests a poorfit to the data. By using a variety of different discrepancy statis-tics, one’s understanding ofhow the model does not fit the data isimproved. See§3.5.1 for more detail.

Another approach attempts to fit a non-parametric model tothe data. If the non-parametric model better explains the data thanthe fiducial model, one rejects the fiducial model as a good fit.Aprocedure for assessing the model families will be described in thenext section. A naive implementation of this idea is difficult, re-quiring a second high-dimensional MCMC simulation to infertheposterior distribution for the non-parametric model and a carefulspecification of the prior distribution. A clever scheme fordoingthis (Verdinelli & Wasserman 1998) is described in§3.5.2.

2.3 Model selection and Bayes factors

One often has doubts about our parametric models, even thosethatfit. This is especially true when the models are phenomenolog-ical rather than the results offirst-principle theories. Therefore,one needs to estimate which competing model better representsthe data. This approach has been applied to good advantage inin-terpreting the spatial fluctuations in the cosmic microwaveback-

ground radiation. The Bayesian model selection approach has beenshown to efficiently answer questions such as: what is the best fitcombinations of baryonic, dark matter and dark energy compo-nents? See Trotta (2008) for a review.

Astronomers are becoming better versed in the more tradi-tional statisticalrejection tests but astronomers often really wantacceptancetests. Bayes factors provide this: one can straightfor-wardly evaluate the evidencein favourof the null hypothesis ratherthan only test evidence for rejecting it.Bayes factorsare the dom-inant method for Bayesian model selection and are analogousto likelihood ratio tests (e.g. Jeffreys 1961; Gelman et al.1995;Kass & Raftery 1995). Rather than using the posterior extremum,one marginalises over the parameter space to get the marginal prob-ability of the data under each model or hypothesis. The ratioof thelikelihood functions marginalised over the prior distributions pro-vides evidence in favour of one model specification over another.In this way, the Bayesian approach naturally includes, requires infact, that one’s prior knowledge of the model and its uncertaintiesbe included in the inference. Although this dependence on the priorprobability sometimes criticised as a flaw in the Bayesian approach,I believe this is as it should be: one’s prior belief will invariably in-fluence one’s interpretation of a statistical finding. The Bayesianframework allows the scientist to describe and incorporatepriorbeliefs quantitatively. In addition, the method demands that the sci-entist thoughtfully characterise prior assumptions to start; such dis-cipline will improve the quality of any scientific conclusions.

Bayes factors are very flexible, allowing multiple hypothesesto be compared simultaneously or sequentially. The method selectsbetween models based on the evidence from data without the needfor nesting2. The posterior probability for competing models can beevaluated over an ensemble of data and used to decide whetherornot a particular family of models should be preferred. Similarly,common parameters can be evaluated over a field of competingmodels with appropriate posterior model probabilities assigned toeach. A tutorial illustrating this can be found in the BIE documen-tation.

Mathematically, Bayes factors follow from applying BayesTheorem to a space of models or hypotheses. LetP (M) be ourprior belief in ModelM, and letP (D|M) be the probability ofobservingD under the assumption of ModelM. The Bayes Theo-rem tells us that the probability of ModelM having the observedD is

P (M|D) =P (M)P (D|M)

P (D)(2)

whereP (D) is some unknown normalisation constant. However,one may use equation (2) to compute the relative probabilityof twocompeting models,Mi andMj :

P (M1|D)

P (M2|D)=P (M1)

P (M2)

P (D|M1)

P (D|M2)(3)

without reference to the unknown normalisation. The left-hand sideof equation (3) may be interpreted as the posterior odds ratio ofModel 1 to Model 2. Similarly, the first term on the right-handsideis the prior odds ratio. The second term on the right-hand side iscalled the Bayes factor. Most often, one does not assert a preferencefor either model and assigns unity to the prior odds ratio.

To define Bayes factors explicitly in terms of the posterior dis-tribution, suppose that one observes dataD; these may comprise

2 Two models arenestedif they share the same parameters and one of themhas at least one additional parameter.

c© 2012 RAS, MNRAS000, 1–17

4

Table 1.Jeffreys’ table

logB12 B12 Strength of evidence

< 0 < 1 Negative (supportsM2)0 to 1/2 1 to 3.2 Barely worth mentioning1/2 to 1 3.2 to 10 Positive1 to 2 10 to 100 Strong> 2 > 100 Very strong

many observations or multiple sets of observations. One wishes totest two competing models (or hypotheses)M1 andM2, each de-scribed by its own set of parameters,θ1 andθ2. One would liketo know which of the following likelihood specifications is better:M1 : L1(D|θ1) or M2 : L2(D|θ2), given the prior distributionsπ1(θ1) andπ2(θ2) for θ1 andθ2. The Bayes FactorB12 is givenby

B12 =P (D|M1)

P (D|M2)=

∫

π1(θ1|M1)P1(D|θ1,M1)dθ1∫

π2(θ2|M2)P2(D|θ2,M2)dθ2. (4)

If B12 > 1, the data indicate thatM1 is more likely thanM2 andvice verse. Harold Jeffreys (1961, App. B) suggested the often-usedscale for interpretation ofB12 in half-unit steps inlogB12 (see Ta-ble 1). This provides a simple-to-use, easily discussed criterion forthe interpretation of Bayes factors. Note that classical hypothesistesting gives one hypothesis (or model) preferred status (the nullhypothesis) and only considers evidence against it; the Bayes fac-tor approach is considerably more general.

Given all of these advantages, why are Bayes factors notmore commonly used? There are two main difficulties. First, mul-tidimensional integrals are difficult to compute. Following equa-tion (4), one needs to evaluate an integral of the form:P (D) =∫

π(θ)P (D|θ)dθ. For a real world model, the dimensionality ofθ

is likely to be> 10. Such a quadrature is infeasible using standardtechniques. On the other hand, a typical MCMC calculation hasgenerated a large number of evaluations of the integrand at consid-erable expense. Can one use the posterior sample to evaluatetheintegral?

Raftery (1995) suggests aLaplace-Metropolisestimator thatuses the MCMC posterior simulation to approximate the marginaldensity of the data using Laplace’s approximation (see Raftery op.cit. for details). In practice, this is only accurate for nearly Gaussian(or normal, eq. B1) unimodal posterior distributions. As part of theBIE development, Weinberg (2012) described two new approachesfor evaluating the marginal likelihood from the MCMC-generatedposterior sample and both of these are implemented in the BIEassecondary analysis routines (see§3.2). In short, I believe that theBIE together with recent advances for computing the marginal like-lihood makes the wholesale computation of Bayes factors feasiblein many cases of interest.

A second well-known difficulty is the sensitivity of Bayes fac-tors to the choice of prior. Most commonly, researchers feelthatvague priors are more appropriate than informative priors.Thisleads to an inconsistency known as the Jeffreys–Lindley paradox(Lindley 1957), which shows that vague priors result in overwhelm-ing odds for or against a hypothesis by varying the parameterthatcontrols the vagueness (e.g. extending the range of an arbitrary uni-form distribution). This apparent problem has led researchers toseek Bayesian hypothesis tests that are less sensitive to their priordistributions. Conversely, I argue that one should not expect a vagueprior to yield a sensible model comparison. Rather, the prior shouldbe used to express prior belief in a theory and, therefore, the result-ing hypothesis test should be sensitive to the prior. This sensitivity

implies that the theory implicit in the model is informed by one’sbackground knowledge. Nonetheless, the prior knowledge isdiffi-cult to quantify and I would still advocate testing a varietyof priordistributions consistent with one’s prior knowledge. Thismay betested through direct sensitivity analyses, such as resimulation withchains at different resolutions and approximate priors.

Regardless of one’s viewpoint, I believe that the BIE projectcurrently provides a useful platform for investigating theuse ofBayesian model comparison and hypothesis testing and hope thatit will help pave the way for new applications. In some cases,com-puting the Bayes factor will be infeasible. For these, the BIE in-cludes an MCMC algorithm that selects between models as partofthe posterior simulation (reversible jump) as described in§3.3.

2.4 Observational requirements

The probability of the data given the parameter vector and themodel,P (D|θ,M) or the likelihood function, is fundamental toany inference, Bayesian or otherwise. Meaningful inferences de-mand that the data presentation include all of the information nec-essary for the modeller to computeP (D|θ,M) accurately and pre-cisely. The more direct the construction ofP (D|θ,M) from thephysical theory, i.e. the less information lost in modelingthe ac-quired observations, the easier it is to calculateP (D|θ,M), lead-ing to a quality result. In other words, the more the data is reducedthrough summary statistics and “cleaned” by applying complexfilters, the less information remains and the greater the effect ofdifficult-to-model correlations. This is somewhat contrary to stan-dard practice.

In addition, astronomers often quote their error models in theform of uncorrelated standard errors. The customary expectationis that each datum, typically a data bin or pixel, should be withinthe range specified by the error bar most of the time. Quoted errorbars are often inflated to make this condition obtain. This leads toa number of fundamental flaws that makes the error model (andtherefore the data) unsuitable for Bayesian inference:

(i) Binned and pixelated data are nearly always correlated.Forexample, a flat-field photometric correction correlates thepixels ofan image over its entire scale. Sky brightness removal has similareffects. There are many additional sources of indirect correlations.Parameter estimations are often sensitive to these correlated excur-sions in the data values and ignoring these correlations will lead toerroneous inferences. Data archivists can facilitate accurate infer-ences by providing correlation matrices for all error models.

(ii) Selection effects must be modeled in the likelihood functionand, therefore, these effects must be well specified by the archivistto facilitate straightforward computation. For example, consider amultiband flux-limited source catalogue. A colour-magnitude orHess diagram in two flux bands will have a non-rectangular bound-ary owing to the flux limit. Although this is a simple example,se-lection effects may be terribly difficult to model; considerspatialvariations in source completeness owing to the diffractionspikesfrom bright stars.

(iii) Astronomers tend to use historically familiar summary datarepresentations that inadvertently complicate the computation ofP (D|θ,M). Continuing the previous example, the magnitude-magnitude diagram contains the same information as the colour-magnitude diagram but the selection effects lie along flux-levelboundaries. For a more complicated example, consider the Tully-Fisher diagram. The input data set may contain flux limits, mor-

c© 2012 RAS, MNRAS000, 1–17


phology selections, image inclination cuts, redshift range limits,just to name a few.

In summary, data processing and reduction, correlates the datarepresentation and complicates the computation ofP (D|θ,M).This renders the modeling process difficult and potentiallyunreli-able. Rather, one should endeavour to separate each source of error,carefully specifying the underlying acquisition process for everyobservational campaign. For example, each pixel datum in a digitalimage may be characterised by the observed data number (dn),gainand bias, read noise, thermal background fraction, etc. Together,their combination yields an error model. Even if a first-principleprocess cannot be described, an empirical distribution or processdescription will be helpful to modelers. For example, a distribu-tion of deviations for measured pixel values relative to a referencecalibration field may be used as an error model. Ideally, the datarepresentation should be as close to the acquired form as possible.In cases where the archiving or presentation of source data is im-practical, the production of a correlation matrix is essential.

The effect of data correlation has been explored by Lu et al.(2011). They describe the parameter inference for a semi-analyticmodel of galaxy formation conditioned on a galaxy mass functionwith both correlated and uncorrelated data bins. The differencesin the posterior distributions for these two cases is dramatic. Whenthe error model is in doubt, the sensitivity of the inferenceto the er-ror model can be investigated in the Bayesian paradigm by puttingprior distributions on parameters of the error models that describetheir uncertainty, marginalising over those hyperparameters, andcomparing with the original posterior. Although this is more ex-pensive and rarely done, one should consider performing such sen-sitivity analyses regularly.

3 SOLUTIONS PROVIDED BY THE BIE

Assume that one has identified the observational selection effects,specified the prior distributions, and constructedP (D|θ,M). Howdoes one compute and the posterior distribution defined by equation(1) using MCMC?

3.1 Computing the posterior distribution

This section presents four MCMC sampling algorithms of increas-ing complexity included in the BIE. I begin with a description andsome motivation for the standard Metropolis-Hastings algorithm.This simple easy-to-use and implement algorithm often fails toconverge and is difficult to tune for complicated high-dimensiondistributions. The next three sections introduce modifications thatcircumvent these pitfalls.

3.1.1 The Metropolis-Hastings algorithm

Metropolis-Hastings is the most well-known of MCMC algorithms(Metropolis et al. 1953; Hastings 1970). This algorithm constructsa Markov chain that generates states from theP (θ|D,M) distri-bution after a sufficient number of iterations. Its success requiresthat the Markov chain satisfy both an ergodicity and a detailed bal-ance condition. The ergodicity condition ensures that at most oneasymptotic distribution exists. Ergodicity would fail, for example,if the chain could cycle back to its original state after a finite num-ber iterations. Ergodicity also requires that all states with positiveprobability be visited infinitely often in infinite time (calledpositive

recurrence). For general continuous state spaces, ergodicity is read-ily achieved. The detailed balance condition ensures that the chainadmits at least one asymptotic distribution. Just as in kinetic the-ory or radiative transfer, one defines a transition process that takesan initial state to a final state with some probability. If theMarkovchain samples the desired target distributionP (θ) = P (θ|D,M),detailed balance demands that the rate of transitionsθ → θ

′ is thesame as the rate of transitions fromθ′ → θ.

One may state this algorithm explicitly as follows. LetP (θ)be the desired distribution to be sampled andq(θ,θ′) be a known,easy-to-compute transition probability between two states. Givenθ, the distributionq(θ, θ′) is a probability distribution forθ′. Leta(θ,θ′) be the probability of accepting stateθ′ given the currentstateθ. In short, one can show that if the detailed balance condition

P (θ)q(θ,θ′)a(θ,θ′) = P (θ′)q(θ′,θ)a(θ′,θ) (5)

holds, then the Markov chain will sampleP (θ). It is straightfor-ward to verify by substitution that acceptance probability

a(θ,θ′) = min

1, [P (θ′)q(θ′,θ)/P (θ)q(θ,θ′)]

(6)

solves this equation (for additional discussion see Liu 2004). Equa-tion (5) has the same form as well-known kinetic rate equations asfollows. Given the probability of a transition over some time inter-val from a state A to some other state B of a physical system andthe corresponding reverse reaction, then the equilibrium conditionfor N = NA +NB systems distributed in the two states is:

NAP (A→ B) = NBP (B → A).

In equation (5), the probability densitiesP (θ) andP (θ′) play therole of the occupation numbersNA andNB and the product of thetransition and acceptance probabilities play the role of the proba-bilitiesP (A→ B) andP (B → A).

The transition probabilityq(θ, θ′) is often chosen to facilitatethe generation ofθ′ from θ. Metropolis et al. (1953) introduceda kernel-like transition probabilityq(θ,θ′) = q(θ − θ

′) whereq(·) is a density. This has the easy-to-use property of generatingθ′ = θ + ξ whereξ ∼ q. (Here and throughout, the notationθ ∼ P (θ) expresses that the distribution function of the variateθ isP (θ).) Further, ifq is symmetric, i.e.q(z) = q(−z), then equation(6) takes the simple form

a(θ,θ′) = min

1,P (θ′)

P (θ)

(7)

The BIE provides two symmetric distributions forq: a multivari-ate normal and a uniform ortop-hatdistribution. Each element ofθ is scaled by a supplied vector ofwidths, w. The choice ofw iscritical to the performance of the algorithm. If the width elementsare too large,P (θ′)/P (θ) will tend to be very small and proposedstates will rarely be accepted. Conversely, if the width elements aretoo small, the new state frequently will be accepted and successivestates will be strongly correlated. Either extreme leads toprocessthat is slow to reach equilibrium. The optimal choice is somewherein between the two. As the dimensionality of the parameter spacegrows, specifying the optimal vector of widths a priori is quite dif-ficult. I will address this difficulty in§3.1.4.

3.1.2 Tempered transitions

In addition to the inherent difficulties associated withtuning thetransition probability, the Metropolis-Hastings state can easily betrapped in isolated modes, between which the Markov chain moves

c© 2012 RAS, MNRAS000, 1–17

6

only rarely. This prevents the system from achieving detailed bal-ance and, thereby, prevents sampling from the desired target dis-tribution P (θ). There are a number of techniques for mitigatingthis so-calledmixing problem. For the BIE, I adopted a synthesisof Metropolis-coupled Markov chains (Geyer 1991) and a sim-ulated tempering method proposed by Neal (1996) calledtem-pered transitions. To sample from a distributionP0(θ) ≡ P (θ)with isolated modes, one defines a series ofn other distributions,P1(θ), . . . , Pn(θ), with Pk being easier to sample thanPk−1. Forexample, one may choose

Pk(θ) ∝ Pβk0 (θ) (8)

with 1 = β0 > β1 > · · · > βn−1 > βn > 0. This construc-tion has a natural thermodynamic interpretation. One may writeP

βk0 (θ) = eβk log(P0) ≡ elog(P0)/Tk . The distribution with tem-

peratureT0 = T = 1 is the original distribution.Hotter distri-butions have higher temperatureTk > T0 and are over-dispersedcompared with the originalcold distribution. In other words, tak-ing a distribution function to a small fractional power decreases thedynamic range of its extrema. In the limitTk → ∞, Pk becomesuniform. Next, the method defines a pair of base transitions for eachk, Tk andTk, which both havePk as an invariant distribution andsatisfy the following mutual reversibility condition for all θ andθ′:Pk(θ)Tk(θ,θ

′) = Tk(θ′,θ)Pk(θ

′).A tempered transition first finds a candidate state by apply-

ing the base transitions in the sequenceT1 · · · Tn. After each up-ward transition, new states are sampled from a broader distribu-tion. In most cases, this liberates the candidate state fromconfine-ment by the mode of the initial state. This is then followed byaseries of downward transitionsTn · · · T1. This candidate state isthen accepted or rejected based on ratios of probabilities involvingintermediate states. Thermodynamically, each levelk correspondsto an equilibrium distribution at temperatureTk. Therefore, the up-ward or downward transitions correspond toheatingor coolingthesystem, respectively. One chooses the maximum temperatureto besufficiently high tomeltany structure in the originalcold posteriordistribution that would inhibit mixing.

Explicitly, the algorithm proceeds as follows. The chain be-gins in stateθ0 and obtains the candidate state,θ0, as follows. Forj = 0 to n− 1, generateθj+1 from θj usingTj+1. Setθn = θn.Then, forj = n to 1, generateθj−1 from θj usingTj . The candi-date state,θ0, is then accepted with probability

a ≡ min

[

1,P1(θ0)

P0(θ0)· · ·

Pn(θn−1)

Pn−1(θn−1)·Pn−1(θn−1)

Pn(θn−1)· · ·

P0(θ0)

P1(θ0)

]

.

(9)Although equation (9) looks complicated, the derivation inNeal(1996) shows that it immediately follows by recursively applyingequations (5) and (6). Then, owing to the mutual reversibility con-dition, theT andT dependence in equation (9) cancels. If the can-didate state is not accepted, the next state of the Markov chain isthe same as the original state,θ0. In practice, I ‘burn-in’ the chainat each levelj for M iterations, withM = 20 typically. Note thateachPi occurs an equal number of times in the numerator and de-nominator in equation (9). Therefore, the acceptance probabilitycan be computed without knowledge of the normalisation constantsfor these distributions. If the acceptance probability is to be reason-ably high, properly-spaced intermediate distributions will have tobe provided that gradually interpolate fromP0 to Pn. Thermody-namically, this corresponds to an adiabatic increase of theheat-bathtemperature followed by an adiabatic return to the originaltem-perature. The maximum temperatureTn = β−1

n should be largeenough to break any barriers between modes at low temperature

but not so large thatPn(θ) provides no constraint on the distribu-tion of θ. Since this value depends on the features ofP (θ) that arenot known a priori, I chooseTn by checking the distribution ofθfor Pn with various values ofTn. Our tests have shown that this al-gorithm works remarkably well for a variety of different inferenceproblems.

Many readers will be familiar with the idea ofsimulated an-nealing(Kirkpatrick et al. 1983). Each step of a simulated anneal-ing algorithm proposes to replace the current state by a state ran-domly constructed from the current state by a transition probability.The magnitude of the excursion allowed by the transition probabil-ity is controlled by a temperature-like parameter that is graduallydecreased as the simulation proceeds. This prevents the state frombeing stuck at local maxima. The tempered transitions algorithmis heuristically similar to simulated annealing with the importantadditional property of obeying detailed balance.

3.1.3 Parallel tempering

The parallel tempering algorithm inverts the order of the previousalgorithm: it simultaneously simulatesn chains, each with targetdistributionPj (eq. 8) and proposes to swap states between ad-jacent members of the sequence at predefined intervals. The hightemperature chains are generally able to sample large volumes ofparameter space, whereas low temperature chains, may becometrapped in local probability maxima. Parallel tempering achievesgood sampling by allowing the systems at different temperatures toexchange states at very different locations in parameter space. Thehigher temperature chains often achieve detailed balance quicklyand accelerate the convergence of the lower temperature chains.Thus, this method may allow a simulation to achieve detailedbal-ance even in the presence of widely separated modes. In some sit-uations, parallel tempering outperforms tempered transitions witha lower overall computational cost. In addition, the parallel tem-pering algorithm is trivially parallelised by assigning each chainits own process. The tempered transitions algorithm is intrinsicallyserial and parallel efficiency is only obtained if the likelihood com-putation is parallelisable.

The parallel tempering algorithm proceeds as follows. At eachstep, a pair of adjacent simulations in the series is chosen at randomand a proposal is made to swap their parameter states, whose accep-tance is determined using a Metropolis-Hastings criterion. Let thejth iterate of the state in thekth chain be denoted asθ[k]

j . The swapis accepted with probability

a = min

[

1,Pk(θ

[k+1]j |D,M)Pk+1(θ

[k]j |D,M)

Pk(θ[k]j |D,M)Pk+1(θ

[k+1]j |D,M)

]

, (10)

wherePk(θ|D,M) is the posterior probability ofθ given the dataD for chaink and model assumptionsM. Final results are basedon samples from theβ0 = 1 chain. As in the tempered transi-tions algorithm, the high-temperature states will mix between sepa-rated modes more efficiently, and subsequent swapping with lower-temperature chains will promote their mixing.

The analog thermodynamic system here is anarray of sys-tems with the same internal dynamics at different temperatures.At higher temperatures, strongly ‘forbidden’ states are likely to re-main forbidden but valleys between multiple modes are likely to bemore easily crossed. In contrast to tempered transitions (§3.1.2), theproposed transitions in parallel tempering aresuddenexchanges ofstate between systems of possibly greatly different temperature. Aswith tempered transitions, the algorithm obeys the detailed balanceequation.

c© 2012 RAS, MNRAS000, 1–17


3.1.4 Differential evolution

Real-world high-dimensional likelihood functions often have com-plex topologies with strong anisotropies about their maxima (see§4.1, Fig. 2). Difficulties in tuning the Metropolis-Hastings transi-tion probability to achieve both a good acceptance rate and goodmixing plagues high-dimensional MCMC simulations of the pos-terior probability. This problem affects all of algorithmsdiscussedup to this point. Recently Ter Braak (2006) introduced an MCMCvariant of a genetic algorithm calleddifferential evolution(Price1997; Storn & Price 1997; Storn 1999). This version of differentialevolution uses an ensemble of chains, run in parallel, to adaptivelycompute the Metropolis-Hastings transition probability.In all thatfollows, I will refer to the MCMC variant as simply differentialevolution.

The algorithm proceeds as follows. Assume that our ensem-ble hasn chains to start, e.g., initialised from the prior probabilitydistribution. Each chain has the same target distributionP (θ). Theoriginal differential evolution algorithm (Price 1997) proposes toupdate memberi as follows:θ[j]

p = θ[j]R0 + γ(θ

[j]R1 − θ

[j]R2) where

R0, R1, R2 are randomly selected without replacement from theset 1, 2, . . . , n. The proposal vector replaces the chosen oneif P (θ

[j]p ) > P (θ

[j]i ). Ter Braak (2006) shows that with minor

modifications the transition probability and the acceptance con-dition for differential evolution obeys detailed balance.The newMCMC version of the differential evoluation algorithm takes theformθ

[j]p = θ

[j]i +γ(θ

[j]R1−θ

[j]R2)+ǫwhereǫ is drawn from a sym-

metric distribution with a small variance compared to that of thetarget, but with unbounded support such as ad dimensional normaldistribution (eq. B1) with very small varianceσ2: ǫ ∼ N d(0, σ2).The random variateǫ is demanded by the recurrence condition: thedomain for non-zero values of the posteriorP must be reached in-finitely often for an infinite length chain. The proposalθ

[j]p is ac-

cepted as the next state for Chaini with probability

a = min

[

1,P (θp)

P (θi)

]

.

In essence, differential evolution uses the variance between a pop-ulation of chains whose distributions are converging to thetargetdistribution to automatically tune the proposal widths. Althoughthe transition probability distributionq(θ,θ′) does not have an an-alytic form in this application, the differential evolution algorithmenforces symmetry through the random choice of indices, andthedistributionq(θ,θ′) clearly exists.

This algorithm as stated above does not address the mixingproblem. Ter Braak (2006) suggests including simulated temperingor simulated annealing in differential evolution. Along these lines,BIE also includes a hybridised differential evolution which periodi-cally performs a tempered transition step for alln chains in parallel.This provides an ensemble at each temperature for the upwardanddownward transitions. As in tempered transitions, we evolve eachchain forM steps at each temperature level. A typical number oftemperature levels is twenty, and therefore, the addition of temper-ing may slow the algorithm by an order of magnitude. Althoughtempereddifferential evolution is dramatically slower than the sim-ple differential evolution, I have found it essential for achieving aconverged posterior sample for many of our real-world astronom-ical inference problems. This method was applied to the Bayesiansemi-analytic models described in Lu et al. (2011).

3.1.5 Summary: choice of a MCMC algorithm

I advocate performing a suite of preliminary simulations toexplorethe features of one’s posterior distribution with various algorithms.My experience suggests that there is no singlebestMCMC algo-rithm for all applications. Rather, each choice representsa set oftrade offs: more elaborate algorithms with multiple chains, aug-mented spaces, etc., are more expensive to run but may be the onlysolution for a complex posterior distribution. Conversely, an elab-orate algorithm would be wasteful for simulating a simple poste-rior distribution. Moreover, combinations of MCMC algorithms inmultiple-chain schemes are often useful.

For distributions with complex topologies, differential evolu-tion relieves the scientist of the task of hand selecting a transitionprobability by trial and error. This method has the advantage ofadded efficiency: states from all chains in a converged simulationprovide valid posterior samples. However, this strategy may back-fire if the posterior is strongly multimodal because differential evo-lution requires multiple chains in each mode to enable mixing be-tween modes. Any single chain in a discrete mode will remain for-ever. Parallel chains and similar algorithms do not have this prob-lem. Although high-temperature chains from tempered methods donot sample the posterior distribution, they do provide useful in-formation for importance sampling. Yoon et al. (2012) has produc-tively used high-temperature samples for Monte Carlo integration.

3.2 Computation of Bayes factors and marginal likelihoods

As described in§2.3, the marginal likelihood plays a key rolein Bayesian model selection. There are several common strate-gies for computing the marginal likelihood. The simplest isdi-rect quadrature using multidimensional cubature algorithms (e.g.Berntsen et al. 1991). Computational complexity limits itsapplica-tion to approximately four or fewer dimensions. Secondly, one mayapproximate the integrand around each well-separated modeas amultivariate normal distribution and integrate the resulting approx-imation analytically. This is theLaplace approximation. It suitssimple unimodal densities approximately Gaussian shape, but theposterior distributions for many real-world problems are far fromGaussian. Finally, one may rewrite Bayes theorem as an expres-sion that evaluates the normalisation constant from the posteriorsample as the harmonic mean of the likelihood function, as will beshown below. In short, none of the three suffices in general: directquadrature is most often computationally infeasible, the Laplaceapproximation works well only for simple posterior distributionsand the harmonic mean approximation often has enormous vari-ance owing to its inverse weighting by the likelihood value (seeKass & Raftery 1995). To help address this lack, Weinberg (2012)presents two computationally-modest families of quadrature algo-rithms that use the full sample posterior but without the instabil-ity of the harmonic mean approximation (Newton & Raftery 1994)or the specificity of the Laplace approximation (Lewis & Raftery1997).

The first algorithm begins with the normalised Bayes theorem:

Z × P (θ|D) = π(θ)L(D|θ) (11)

where

Z ≡∫

Ω

dθ P (θ|D) =

∫

Ω

dθ π(θ)L(D|θ) (12)

normalisesP (θ|D) (as in eq. 1). The quantityZ is called thenor-malisation constantor marginal likelihooddepending on the con-

c© 2012 RAS, MNRAS000, 1–17

8

text. Dividing byL(D|θ) and integrating overθ we have

Z ×∫

Ω

dθP (θ|D)

L(D|θ) =

∫

Ω

dθ π(θ). (13)

Since the Markov-chain samples the posterior,P (θ|D), the com-putation of the integral on the left from the chain appears asaninverse weighting with respect to the likelihood. This is poorly con-ditioned owing to the inevitable small values ofL(D|θ). However,if the integrals in equation (13) are dominated by the domainsam-pled by the chain, the integrals can be approximated by quadratureover a truncated domain,Ωs that eliminates the small number ofthe chain states with lowL(D|θ). More precisely, the integral onthe left-hand-side may be cast in the following form:

∫

Ωs

dθP (θ|D)

L(D|θ) =

∫

dY M(Y ). (14)

This integral will be a good approximation to the original ifthemeasure function defined by

M(Y ) =

∫

1/L(D|θ)>Y

dθ P (θ|D). (15)

decreases faster thanL asL → 0. Otherwise, the integral in equa-tion (14) does not exist and the first algorithm cannot be used; seeWeinberg (2012) for details. Intuitively, one may interpret this con-struction as follows: divide up the parameter spaceθ ∈ Ωs intovolume elements sufficiently small thatP (θ|D) is approximatelyconstant within each volume element. Then, sort these volume el-ements by their value ofY (D|θ) ≡ L−1(D|θ). The probabilityelementdM ≡M(Y +dY )−M(Y ) is the prior probability of thevolume betweenY andY + dY . However, if the truncated volumeforms the bulk of the contribution to equation (14), the evaluationwill be inaccurate.

To evaluate the r.h.s. of equation (13), one may use the sam-pled posterior distribution itself to tessellate the sampled volumein Ωs ⊂ Ω. This may be done straightforwardly using a space-partitioning structure. A binary space partition (BSP) tree, whichdivides a region of parameter space into two exclusive sub regionsat each node, is particularly efficient. The most easily implementedtree of this type for arbitrary dimension is the kd-tree (short fork-dimensional tree). The kd-tree algorithms splitR

k on planes per-pendicular to one of the coordinate system axes. The implementa-tion provided for the BIE uses the median value along one of axes(a balancedkd-tree). I have also implemented a hyper-octree. Thehyper-octree generalises the octree by splitting each n-dimensionalparent node into2n hypercubic children. Unlike the kd-tree, thehyper-octree does not split on point location and the size ofthecells is not strictly coupled to the number of points in the sample.In addition, the cells in the kd-tree might have extreme axisratiosbut the cells in the hyper-octree are hypercubic. This helpsprovidea better representation of the volume containing sample points. SeeWeinberg (2012) for additional details, tests, and discussion. Ap-proximate tessellations also may be useful. For example, the near-est neighbour to every point in a sample could be used to circum-scribe each point by a sphere of maximum volume such that allspheres are non-overlapping. Comparisons and performancedetailswill be reported in a future contribution (Yoon et al. 2012).

For cases where the integral in (14) does not exist or the firstalgorithm provides is a poor approximation,Z may be evaluateddirectly using the second computational approach. Begin byinte-grating equation (11) overΩs ⊂ Ω:

Z ×∫

Ωs

dθ P (θ|D) =

∫

Ωs

dθ π(θ)L(D|θ). (16)

The Monte Carlo evaluation of the integral on the left-hand sideis simply the fraction of sampled states inΩs relative to the entiresample:FΩs ≡ ∑

θi∈Ωs1/

∑

θi∈Ω 1. The integral on the right-hand side may be evaluated using the space-partitioning proceduredescribed above. Altogether, then, one has

Z = F−1Ωs

∫

Ωs

dθ π(θ)L(D|θ) (17)

whereΩs is ideally chosen to avoid regions of very low posteriorprobability. This method has no problems of existence for properprobability densities.

There are several sources of error in this space partition. Fora finite sample, the variance in the tessellated parameter-space vol-ume will increase with increasing volume and decreasing posteriorprobability. This variance may be estimated by bootstrap. As usual,there is a variance–bias trade-off in choosing the resolution of thetiling: the bias of the probability value estimate increases and thevariance decreases as the number of sample points per volumeele-ment increases. Some practical examples suggest that the resultingestimates are not strongly sensitive to the number of pointsper cell.

In summary, the choice between the various algorithms de-pends on the problem at hand. The Laplace approximation maybe a good choice for posterior distributions that are unimodal withlight tails but this is often not the case for real-world problems. Iinvestigate the performance of the algorithms in Weinberg (2012)for high-dimensional distributions in Yoon et al. (2012). To date, Ihave reliably evaluatedZ for n ≤ 14 using a MCMC-generatedsamples of approximately106 points with auto-correlation lengthsof approximately 20. Additional performance details and tests inastronomical applications are described in Yoon et al. (2012).

3.3 Dimension switching algorithms

When generating large sample sizes that are necessary for anaccu-rate computation of the marginal likelihood is impractical, one maypropose a number of different models and choose between themaspart of the Monte Carlo sampling process. This is done by adding adiscrete indicator variable to the state to designate the active model.The resulting state space consists of a discrete range for the indi-cator and of continuous ranges for each of the parameters in eachmodel. Green (1995) showed that the detailed balance equation canbe formulated in such a general state space. This allows one to pro-pose models of different dimensionality and thereby incorporatemodel selection into the probabilistic simulation itself.The algo-rithm requires a transition probability to and from each subspace(Green 1995).

For example, suppose one has an image of a galaxy field thatone would like to model with some unknown number of distinctgalaxiesk ∈ [1, n]. The extended sample space, then, consists ofn subspaces, each one of which contains the parameter vectorsforeach of thek galaxies for each subspace. Suppose that the currentstate is in thek = 3 subspace; that is, the current model has threegalaxies. With some predefined probability at each step,p(3, 4),the algorithm proposes a transition tok = 4 galaxies by splittingone of three galaxies into two separate but possibly blendedcom-ponents. Similarly, with some predefined probability at each step,p(4, 3), one defines the reverse transition by combining two adja-cent components into one component. Finally, no subspace transi-tion is proposed with the probabilityp(3, 3) = 1−p(3, 4)−p(4, 3).For model comparison, an estimate of the marginal probability foreach modelk follows directly from the occupation frequency ineach subspace of the extended state space.

c© 2012 RAS, MNRAS000, 1–17


To make this explicit following Green (1995), one first de-fines reversible transitions between models in different subspaces,say i and j. This is accomplished by proposing a bijectivefunction gij that transforms the parameters between subspacesgij(θ

[i], φ[i]) = (θ[j], φ[j]), and enforces the dimensional match-ing conditiond(θ[i]) + d(φ[i]) = d(θ[j]) + d(φ[j]) where the op-eratord(·) returns the rank of the vector argument. The param-eter vectorφ[·] is a random quantity used in proposing changesin the components and for choosing additional components whengoing to a higher dimension. The rank ofφ[·] may be zero. Forexample, ifd(θ[i]) = 2 and d(θ[j]) = 1 then one may defined(φ[i]) = 0 andd(φ[j]) = 1. In other words, for the purposes ofinter-dimensional transitions, each subspace is augmented by ran-dom variates with the constraint that the dimensionality ofthe aug-mented spaces match.

Then, if qij(θ[i], φ[i]) is the probability density for the pro-posed transition andp(i, j) is the probability to move from sub-spacei to subspacej, the acceptance probability may be writtenas

αij(θ[i], θ[j]) = (18)

min

1,Pj(θ

[j]|D)p(j, i)qji(θ[j], φ[j])

Pi(θ[i]|D)p(i, j)qij(θ[i], φ[i])

∣

∣

∣

∣

∂(θ[j], φ[j])

∂(θ[i], φ[i])

∣

∣

∣

∣

where the final term in the second argument ofmin· is theJacobian of the mapping between the augmented spaces, andPj(θ

[j]|D) is the posterior probability density for the model in sub-spacej. The probability densitiesp(i, j) andqij are selected basedon prior knowledge of the problem and to optimise the overallrateof convergence. The algorithm can be summarised as follows.As-sume that the state at iterationi is in subspaceni with parametervectorθ[ni]

i . One proposes a new state as follows:

(i) Choose a new modelj by drawing it from distributionp(ni, ·). Propose a value for the parameterθ[j] by samplingφ[ni]

from the distributionqnij(θ[ni]i , φ[ni]).

(ii) Accept the move with probabilityαnij(θ[ni], θ[j]).

(iii) If the move is accepted, letni+1 = j andθ[ni+1]

i+1 = θ[j].(iv) If the move is not accepted, stay in the current subspace:

ni+1 = ni andθ[ni+1]

i+1 = θ[ni]i .

Green (1995) named this algorithmReversible Jump Markov chainMonte Carlo(RJMCMC). One often chooses to interleave RJM-CMC steps with some number of standard Metropolis-Hastingssteps to improve the mixing in each subspace.

Within the constraints, the transitions are only limited byone’sinventiveness. However, I have found that the most successful tran-sitions are intuitively motivated by features of the scientific prob-lem. Although RJMCMC enables model selection between arbi-trary families of models with various dimensionality, the conver-gence of the MCMC simulation depends on the specification of thetransition probabilities and a careful tuning of the MCMC algo-rithm. In addition, I have not tried RJMCMC with multiple-chainalgorithms that propose transitions between chains. This appears tobe formally sound but the requirement that the transitioning chainsbe in the same subspace may yield very long mixing times. Sim-ilarly, the differential evolution algorithm (see§3.1.4) would re-quire that each subspace of interest be populated by some numberof chains all times.

Table 2.RJMCMC applied to the images in Figure 1

Model A3/A2 A3/Atotal p(2)† p(3) p(4)

1 1.0 0.167 0.0 0.9997 0.00032 0.5 0.091 0.0 0.9997 0.00033 0.3 0.057 0.444 0.556 0.00034 0.2 0.038 0.963 0.036 0.0015 0.0 0.0 0.998 0.002 0.0

† p(m) is the probability of states in the subspace withm compo-nents. For these simulations,p(1) = p(5) = p(6) = 0

3.3.1 Example: a simple transition probability

Consider two models, Model 1 with two real parameters:θ[2] :

(θ1, θ2) ∈ R2 and Model 2 with one real parameterθ[1] :θ ∈ R. Let us assume that the prior probability of transi-tion between the models isp(1, 2) = p(2, 1) = 1/2 andadopt the transformation between the subspaces:g12(θ1, θ2) =[(θ1 + θ2)/2, (θ1 − θ2)/2] = (θ, φ). The variableφ is distributedasq(φ). The inverse transformation is:g21(θ, u) = (θ+φ, θ−φ).Therefore, givenθ, one drawsφ from q(φ) and immediately obtain(θ1, θ2). The acceptance probability for(θ1, θ2) → θ becomes

α21 =π1(θ)q(φ)

π2(θ1, θ2)

∣

∣

∣

∣

∂(θ, φ)

∂(θ1, θ2)

∣

∣

∣

∣

=π1

(

θ1+θ22

)

q( θ1−θ22

)

2π2(θ1, θ2). (19)

and forθ → (θ1, θ2):

α12 =2π2(θ + φ, θ − φ)

π1(θ)q(φ). (20)

3.3.2 Example: BIE mixture modeling

Many problems in astronomy are mixtures of components drawnfrom the same model family. Each componentj in the mixture isadditively combined with a weightwj such that

∑mj=1 wj = 1.

This allows the predefinition of some generic RJMCMC transitionsthat are likely to work for a wide variety of problems. I considertwo types of transitions. The first is thebirth anddeathof a com-ponent. The birth step is implemented by selecting a new compo-nent from the prior distribution for the component parameters forthe user-specified model. The prior for the weights is chosenhereto be a Dirichlet distribution (eq. B4) with a single user-specifiedshape parameter since each component is assumed to be indigu-ishable a priori. Assume that the new component is born in statewith m components. The newm + 1 weight is selected from theprior distribution form + 1 weights after marginalising overmof the them. Them weights in the current state are scaled to ac-commodate the choice. The death of a component is the inverseof the birth step defined by detailed balance. The second typeoftransitions aresplit and join transitions; again, these are inverses.For the split step, one selects a component at random and splitseach component with an additive or multiplicative shift as follows:θ1 = θ0 + δθ, θ2 = θ0 − δθ, or θ1 = θ0(1 + ǫ), θ2 = θ0(1− ǫ).

For an example, consider a toy model for a group of galax-ies where the light from each galaxy has a normal distributionwith the same width. Each image has two well-separated compo-nents withA1 = 4000 andA2 = 1000 counts. A tertiary com-ponent with five different amplitudesA3 is added along the linebetween the two, separated by a half width. With this choice,thesecondary and tertiary components are blended and appear asasingle peak (see Fig. 1 for details). The results of applyingthe re-versible jump transitions above to these four images and onewith

c© 2012 RAS, MNRAS000, 1–17

10

(a) Model 1 (b) Model 2 (c) Model 3 (d) Model 4

Figure 1. Four test models with three two-dimensional Gaussian distributions of width 0.1 in each dimension. The three components have centres at (0.2, 0.2),(0.9, 0.9), (0.3, 0.3). The first and second components for all models were realised with 1000 and 4000 counts respectively. The third component for Models1, 2, 3 and 4 have have 1000, 500, 300, and 200 counts, respectively. Components Two and Three are barely distinguishable by eye, even for Models 1 and 2and appear as an elongation. The “by eye” differences between Models 3, 4 and Model 5 with zero tertiary amplitude are not easily distinguished.

an empty tertiary component is described in Table 2. The total num-ber of counts isAtotal =

∑3i=1 Ai andp(m) denotes the proba-

bility of states in subspacem. The prior probability on the centrecoordinates is uniform in the range(−0.2, 1.2). The prior prob-ability on each subspace withm ∈ [1, 6] components is Poissonwith a mean of 1; that is, I am intentionally preferring fewercom-ponents, but tests with Poisson means of 2 and 3 show that the re-sults are insensitive to this choice. The MCMC simulations use thetempered transitions algorithm described in§3.1.2 withTmax = 8and 16 logarithmically-spaced temperature levels. The Metropolis-Hastings transition probability is a uniform ball whose widths areadjusted to achieve an acceptance rate of approximately 0.15. TheRJMCMC algorithm puts nearly all of the probability on the two-and three-component subspaces. The posterior distribution of com-ponent centres are accurately constrained to the input values in thecorrect subspace. Table 2 reveals a sharp transition between prefer-ring three to two components at an amplitude ratio of the tertiary tosecondary amplitude,A3/A2, at 0.3 (Model 3) and below. Figure1 shows that this procedure reflects our expectation that RJMCMCshould identify three separate components when one can do sobyeye. Of course, the subtle visual asymmetries that allows usto dothis would be obscured by noise that would be included in a realis-tic model.

In summary, RJMCMC allows one to identify the number ofcomponents in a mixture without using Bayes factors. The simpleset of transitions used here are likely to work well for a widevari-ety of model families since they depend on the mixture nature, notproperties of the underlying model families. Unlike Bayes factors,RJMCMC simulations may not be reused for model comparisonsin light of a new model; rather, the RJMCMC simulation must berepeated including the new model.

3.4 Convergence testing

The BIE provides extensible support for convergence testing. Con-vergence testing has two goals: (1) determining when the simula-tion is sampling from the posterior distribution, and (2) determin-ing the number of samples necessary to represent the distribution.Here, I address the first goal. For multiple chains, the work horse isthe commonly used Gelman & Rubin (1992) statistic. This methodcompares the interchain variance to the intrachain variance for anensemble of chains with different initial conditions; the similarityof the two is a necessary condition for convergence.

For single-chain algorithms, I have had good success with adiagnostic method that assesses the convergence of both marginaland joint posterior densities following Giakoumatos et al.(1999,hereafter GVDP). This method determines confidence regionsforthe posterior mean by batch sampling the chain. As the distributionconverges, the distribution of the chain states about the mean willapproach normality owing to the central limit theorem: the vari-ance as a function of1/

√N for a sample sizeN will be linearly

correlated for a converged simulation. This approach generalisesGelman & Rubin (1992) who used the coefficient of determina-tion (C.O.D., the square of the Pearson’s product-moment correla-tion coefficient, e.g. Press et al. 1992,§14.5) to assess convergence.Moreover, GVDP use the squared ratio of the lengths of the empir-ical estimated confidence intervals for the parameters of interest asan alternative interpretation of theR diagnostic. This alternativecalculation ofR is simpler to compute than the original ratio ofvariances and is free from the assumption of normality.

For parallel chain applications, and especially for differentialevolution (see§3.1.4), some chains will get stuck in regions ofanomalously low posterior probability. Several outliers in a largeensemble of chains can be removed without harming the simula-tion as long as the chains are independent.

3.5 Goodness-of-fit testing

As described in§2, model assessment is an essential component ofinference. I have explored two approaches: Bayesianp-values andBayes factors for a non-parametric model. The first is easilyappliedbut is a qualitative indicator only. The second has true power as ahypothesis test but is computationally intensive.

3.5.1 Posterior predictive checking

Once one has successfully simulated the posterior distribution, onemay predict future data points easily. The predicted distribution ofsome future dataDpred after having observed the dataD is

p(Dpred|D) =

∫

p(Dpred,θ|D) dθ

=

∫

p(Dpred|θ,D)p(θ|D) dθ, (21)

called theposterior predictive distribution. In the last integral ex-pression in equation (21),p(Dpred|θ,D) is the probability of ob-

c© 2012 RAS, MNRAS000, 1–17


servingDpred given the model parameterθ and observed data setD. The distributionp(θ|D) is the posterior distribution. For manyproblems of interest, the probability of observing some newdatagiven the model parameterθ is independent of the originalD. Inmany cases,p(Dpred|θ,D) will be the standard likelihood func-tion p(Dpred|θ), simplifying equation (21). One may simulate theposterior predictive distribution using an existing MCMC sampleas follows: 1) samplem values ofθ from the posterior; 2) for eachθ in a posterior set, sample a value ofD

pred from the likelihoodp(Dpred|θ). Them values ofDpred represent samples from theposterior predictive distributionp(Dpred|D).

One can attempt to check specific model assumptions withposterior predictive checks(PPC) using the posterior predictive dis-tribution. The idea is simple: if the model fits, predicted data gen-erated under the model should look similar to the observed data.That is, the discrepancy measure applied to the true data shouldnot lie in the tails of the predicted distribution. If one sees somediscrepancy, does it owe to an inappropriate model or to randomvariance? To answer this question (following Gelman 2003, andreferences therein), one generatesM data sets,Dpred

1 , . . . ,DpredM

from the posterior predictive distributionp(Dpred|D). Now onechooses some number of test statisticsT (D,θ) that measure thediscrepancy between the data and the predictive simulations. Thesediscrepancy measures can depend on the dataD and the parametersand hyperparametersθ, which is different from standard hypothe-sis testing where the test statistic only depends on the data, but noton the parameters. The discrepancy measuresT (D,θ) need to bechosen to investigate deviations of interest implied by thenatureof the problem at hand. This is similar to choosing a powerfulteststatistic when conducting a hypothesis test. Any chosen discrep-ancy measure must be meaningful and pertinent to the assumptionyou want to test. Examples of this approach using the BIE may befound in Lu et al. (2012).

3.5.2 Non-parametric tests

This class of goodness-of-fit tests weights a parametric null hypoth-esis against a non-parametric alternative. For example, one maywish to test the accuracy of an algorithm that has producedn inde-pendent variatesθ1:n = (θ1, θ2, . . . , θn) intended to be normallydistributed. One would test the null hypothesis that the true densityis the normal distributionN (µ, σ2) against a rich non-parametricclass of densities by placing a prior distribution on the null and al-ternative hypotheses and calculating the Bayes factor. This leads todifficult, high-dimensional calculations.

For the BIE, I adopt a remarkably clever method proposed byVerdinelli & Wasserman (1998, hereafter VW) to perform a non-parametric test without proposing alternative models directly. TheVW approach is based on the following observation: since thecu-mulative distribution function for a variableθ, F (θ), is strictly in-creasing and continuous, the inverseF−1(u) for u ∈ [0, 1] is theunique real numberθ such thatF (θ) = u. In the multivariatecase, the inverse of the cumulative distribution function will notbe unique generally, but, instead, one may define

F−1(u) = infθ∈Rd

F (θ) ≥ u (22)

for a parameter vectorθ of rankd. Then, rather than defining a gen-eral class of densities inRd to propose the alternative, VW considera functional perturbation toF ,G(F (θ)) say, such thatG maps theunit interval onto itself. The identity,G(u) = u, is the unperturbedprobability distribution. Then, the test evaluates the uniformity of

Table 3. Marginal likelihood values for Verdinelli-Wasserman tests de-scribed in§3.5.2

Class Model type lnP (D|M) B†12

VW (1) Gaussian 586.7+1.6−0.01 −1.4+1.6

−0.02Fiducial (2) Gaussian 588.1+0.01−0.01

VW (1) Power law 588.7+4.1−3.5 474.2+4.1

−3.5Fiducial (2) Power law 114.5+0.01−0.01

† Following the definition in§2.3, B12 denotes the odds thatModel 1 is more likely than Model 2.

the distribution of probabilities under each hypothesis. To be con-crete, letF0 describe the probability of sampleθ under hypothe-sisH0. That is,F0 is the cumulative distribution function of theposterior distribution. UnderH0, one expects each random variateF0(θj) for all j to be independent and identically distributed (iid)as a uniform distribution in the unit interval. This may be notatedas follows:

H0 : F0(θ1), F0(θ2), , F0(θn)iid∼ Uniform(0, 1).

Alternatively, a poor fit will satisfy the hypothesisH1:

H1 : F0(θ1), F0(θ2), , F0(θn)iid6∼ Uniform(0, 1).

To construct the functional perturbation G,Verdinelli & Wasserman use a sequence of Legendre polyno-mials, ξj(·), j = 1, 2, . . ., defined over the unit interval toconstruct infinite exponential densities of the form

g(u|ψ) = exp

[

∞∑

j=1

ψjξj(u)− c(ψ)

]

ψ = (ψ1, ψ2 . . .) are coefficients and

c(ψ) = log

∫ 1

0

du exp

[

∞∑

j=1

ψjξj(u)

]

is a normalising constant. VW put priors onψ, given byψj ∼N (0, τ 2/c2j ) where τ and thecj are appropriately chosen con-stants. VW also specify a hyperprior onτ : τ ∼ N (0, w2) trun-cated to the positive values. This distribution provides finite prob-ability for obtaining the null hypothesis nearτ = 0 and decreasesmonotonically for larger perturbations from the null, maintainingthe perturbative nature of the alternative hypothesis.

Intuitively, this development is closely related to the prob-abilistic interpretation of the marginal likelihood and Bayes fac-tors. To see this, consider the one-dimensional case for simplicity:let f(D|θ) = P (θ)P (D|θ) andF0(θ) =

∫ θ

−∞dθ f(D|θ) and

P (D) = F0(∞). If the distribution ofF0(θi) for θi is not uni-form in [0, 1], one can perturbf(D|θ) by moving some densityfrom a region of under sampling to a region of over sampling and,thereby, increaseP (D).

For an example, I apply the VW method to the following twomodels defined by a two-dimensional normal distribution andbya power-law-like distribution with unknown centres and widths ineach dimension:

PGauss(x, y; θx, θy, σx, σy) =

(

1

2πσxσy

)

e−r2/2, (23)

PPower(x, y; θx, θy , σx, σy, α) =

(

1

2πσxσy

)

α(1 + α)

(1 + r)2+α

(24)

c© 2012 RAS, MNRAS000, 1–17

12

wherer2 ≡ x2/σ2x + y2/σ2

y andα > 0. For the examples here,I adoptα = 1. I take the Gaussian model,PGauss, or power-lawmodel,PPower, to be the null hypothesis (denoted 0). For eachmodel, I assume the centres are normally distributed with zeromean and unit variance and that the variance is Weibull distributed(eq. B2) with scale 0.03 and shape of 1. A 1000-point data set issampled from a two-dimensional normal distribution centred at theorigin with a root variance of each dimension of 0.03.

I take the same model with the VW extension withn = 5 basisfunctions to be the alternative hypothesis (denoted 1) and test thesupport for the two hypotheses using Bayes factors. Although VWrecommend the choicew = 1 for the hyperprior, I found that thistends to overfit the extended model, not favouring the model whenit is correct. Rather I adoptw = 4 which suppresses this tendency.I performed four MCMC simulations with and without the VW ex-tension and with both the Gaussian and power-law models. Eachused the tempered differential evolution algorithm (see§3.1.4) with32 chains andTmax = 32 to obtain two million converged states.Then, the posterior samples were batched into groups of 250,000states and the marginal likelihood was computed using the algo-rithms described in§3.2. The 90% credible interval was computedby bootstrap analysis from the batches. The results are summarisedin Table 3. One sees from the table that the true model is preferredto the extended VW model, but only mildly. Conversely, the power-law model is strongly disfavoured relative to the extended model forthe normal data sample. Also note: the marginal likelihood valuefor the normally distributed data given the normal model (Row 1in the table) has almost the same value as the extended VW power-law model (Row 4 in the table). As pointed out by VW, the ex-tended model provides a estimator of the true posterior density. Theagreement of these two values suggests that the five basis functionsprovide sufficient variation to reproduce the true value andthat theten-dimensional numerical evaluation of the marginal likelihood in-tegral using the methods described in§3.2 is sufficiently accurate.

4 CASE STUDIES

4.1 Semi-analytic galaxy formation models: BIE-SAM

Many of the physical processes parametrised in semi-analyticalmodels of galaxy formation remain poorly understood and underspecified. This has two critically important consequences for infer-ring constraints on the physical parameters: 1) prior assumptionsabout the size of the domain and the shape of the parameter dis-tribution will strongly affect any resulting inference; and 2) a verylarge parameter space must be fully explored to obtain an accurateinference. Both of these issues are naturally tackled with aBayesianapproach that allows one to constrain the theory with data ina prob-abilistically rigorous way. In Lu et al. (2011), we presented a semi-analytic model (SAM) of galaxy formation in the framework ofBayesian inference and illustrated its performance on a test prob-lem using the BIE; we call the combined approach BIE-SAM. Oursixteen-parameter semi-analytic model incorporates all of the mostcommonly used parametrizations of important physical processesfrom existing SAMs including star formation, SN feedback, galaxymergers, and AGN feedback.

To demonstrate the power of this approach, the thirteen ofthese parameters that can be constrained by the K-band luminos-ity function were investigated in Lu et al. (2011). We find that theposterior distribution has a very complex structure and topology,indicating that finding the best fit by tweaking model parametersis improbable. As an example, Figure 2 describes isosurfaces of

the posterior distribution in three of thirteen dimensions. The sur-faces have a complex geometry and are strongly inhomogeneous inany parameter direction. Moreover, the posterior clearly shows thatmany model parameters are strongly covariant and, therefore, theinferred value of a particular parameter can be significantly affectedby the priors used for the other parameters. As a consequence, onemay nottune a small subset of model parameters while keepingother parameters fixed and expect a valid result.

Apropos the discussion in§2.4, by using synthetic data tomimic systematic uncertainties in the reduced data, we alsohaveshown that the resulting model parameter inferences can be signif-icantly affected by the use of an incorrect error model. We used asynthetically-generated binned stellar mass function andperformedtwo inferences: one with a realistic covariance and one withnooff-diagonal covariance. The contours with the full covariance ma-trix are more compact, but there are also noticeable changesinthe shape and orientation of the posterior distribution. This clearlydemonstrates that an accurate analysis of errors, both sampling er-rors and systematic uncertainties, are crucial for observational data,and conversely, a data-model comparison without an accurate errormodel is likely to be erroneous.

The method developed here can be straightforwardly appliedto other data sets and to multiple data sets simultaneously.In ad-dition, the Bayesian approach explicitly builds on previous re-sults by incorporating the constraints from previous inferences intonew data sets; the BIE is designed to do this automatically. Formany processes in galaxy formation, competing models have beenproposed but not quantitatively compared. Bayes factor analyses(see§3.2) or explicit model comparison techniques such as the re-versible jump algorithm (Green 1995, see also§3.3) can provide aquantitative comparison of different models for a given data set.

4.2 Galphat

Yoon et al. (2011) describes Galphat (GALaxy PHotometric AT-tributes), a Bayesian galaxy image analysis package built for theBIE, designed to efficiently and reliably generate the posteriorprobability distribution of model parameters given an image. Fromthe BIE point of view, both Galphat and the BIE-SAM are like-lihood functions. The Galphat likelihood function is designed toproduce high-accuracy photometric predictions for galaxymodelsin an arbitrary spatial orientation for a given point-spread function(PSF). Accurate predictions in both the core and wings of theim-age are essential for reliable inferences. The pixel predictions arecomputed using an adaptive quadrature algorithm to achievea pre-defined error tolerance. The rotation and PSF convolution are per-formed on sub-sampled grids using an FFT algorithm. Galphatcanincorporate any desired galaxy image family. For speed, we pre-compute subsampled grids for each model indexed by parametersthat influence their shape. The desired amplitude and linearscaleare easily computed through coordinate transformations. To enablethis, the images are stored as two-dimensional cumulative distri-butions. Our current implementation uses Sersic (1963) models fordisk, bulge and spheroid components, however, this approach is di-rectly applicable to any model family.

Using the various tempering algorithms available in the BIE,our tests have demonstrated that we can achieve a steady-state dis-tribution and that the simulated posterior will include anymulti-ple modes consistent with the prior distribution. Given thepos-terior distribution, we may then consistently estimate thecredi-ble regions for the model parameters. We show that the surface-brightness model will often have correlated parameters and, there-

c© 2012 RAS, MNRAS000, 1–17


Figure 2. The likelihood function for 3 out of the 13 free parameters inthe BIE-SAM from Lu et al. (2011): the star-formation threshold surface densityfSF ,the star-formation efficiency power-law indexαSF , and the supernova feedback energy fractionαSN . The blue (red) surfaces enclose approximately 10%(67%) of the density. See Lu et al. (2011) for additional details.

Figure 3. The size–Sersic index relation inferred from a synthetically-generated sample of elliptical galaxy images Left: a scatter plot using the best-fitparameters. The red curve shows a smooth fit to the ridge-lineof the points. Right: the marginal posterior density for thesame parameters. While the left-hand plot suggests that small galaxies have low concentrations, the right-hand plot of posterior density correctly reveals that this trend is an artifact of themodel-fitting procedure. These figures originally appearedin Yoon et al. (2011).

fore, any hypothesis testing that uses the ensemble of posterior in-formation will be affected by these correlations. The full posteriordistributions from Galphat identify these correlations and incorpo-rate them in subsequent inferences.

These issues are illustrated in Figures 3 and 4. Figure 3shows the size–Sersic index relation inferred from a synthetically-generated sample of elliptical galaxy images. The left-hand panelshows the traditional scatter diagram of maximum posteriorparam-

eter values; that is, this figure plots the effective radius and Sersicindex for the maximum likelihood value obtained for each imagefit. The red curve is a smooth estimate of the trend. The right-handpanel shows the inferred distribution based on the full posterior dis-tribution of the ensemble after marginalising over all of the param-eters but the effective radius and the Sersic index. The left-handpanel incorrectly suggests that smaller galaxies are less concen-trated while the right-hand panelcorrectlyreveals that the size and

c© 2012 RAS, MNRAS000, 1–17

14

Figure 4. Marginal posterior densities for 100 galaxies with a signal-to-noise ratio of 20 and randomly sampled Sersic model parame-ters. The values are magnitude difference from the input values (∆MAG), galaxy half-light radius scaled by the input values (rie/re),the Sersic index difference (∆n) and the sky value difference (∆sky in percent). The blue, red and grey curves and contours showgalaxies withn ≥ 2, n < 2 and the total sample, respectively. The parameter covariance depends on both the signal-to-noise ratio andthe Sersic index.

concentration are uncorrelated. Figure 4 illustrates the correlationbetween parameters for low-concentration galaxies (Sersic indexn ≤ 2) in blue and high-concentration galaxies (n > 2) in red withthe total in grey scale.

We can use posterior simulations over ensembles of imagesto test, for example, the significance of cluster and field environ-ments on galaxies as evidenced in their photometric parameters,such as the correlation between bulge-to-disk ratio and environ-ment. A more elaborate example might include models for higherangular harmonics of the light distribution and we could determinethe support for these in the data using Bayes factors (§3.2).

5 DISCUSSION AND SUMMARY

Advances in digital data acquisition along with storage andretrievaltechnologies have revolutionised astronomy. The promise of com-bining these vast archives have led to organisation frameworks suchas the National Virtual Observatory (NVO) and the InternationalVirtual Observatory Alliance (IVOA). To realise the promise of thisvast distributed repository, the modern astronomer needs and wantsto combine multi-sensor data from various surveys to help con-strain the complex processes that govern the Universe. The neces-

sary tools are still lacking, and the Bayesian Inference Engine (BIE)was designed as a research tool to fill this gap. This paper outlinesthe motivation, goals, architecture, and use of the BIE and reportsour experience in applying MCMC methods to observational andtheoretical Bayesian inference problems in astronomy.

Most researchers are well-versed in the identifying “best”pa-rameters for a particular model for some data using the maxi-mum likelihood method. For example, consider the fit of a surfacebrightness model to galaxy images. Parameters from the maximum-likelihood solutions are typically plotted in a scatter diagram andtrends are interpreted physically. However, plotting scatter dia-grams from multiple data sources inadvertently mixes errormodelsand selection effects.§2.1 described the pitfalls of this approach.Rather, the astronomer wants to test the hypothesis that thedatais correlated with a coefficient larger than some predeterminedvalueα, a complexhypothesis test. However, without incorporat-ing the correlations imposed by both the theoretical model,the er-ror model, and the selection process, the significance of thetestis uncertain. Similarly, the astronomer needs methods of assess-ing whether a posited model is correct. I have divided these needsinto two categories: goodness-of-fit tests (§2.2) and model selec-tion (§2.3). As an example of the former, the astronomer may have

c© 2012 RAS, MNRAS000, 1–17


found the best parameters using maximum likelihood, but does themodel fully explain all of the features in the data? If it doesnot,one must either modify or reject the model before moving on tothenext step. As an example of the latter, suppose an inference resultsin two parameter regions or multiple models that explain thedata.Which modelbestexplains the data?

All of these wants and needs—combining data from multi-ple sources, estimating the probability of model parameters, assess-ing goodness of fit, and selecting between competing models—arenaturally addressed in a single probabilistic framework known asBayesian inference. In particular, Bayesian inference provides adata-first discipline that demands that the error model and selec-tion effects are specified by the probability distribution for the datagiven the modelM, P (D|θ,M), colloquially known as the like-lihood functionL(D|θ,M). Prior results including quantified ex-pert opinion are specified in the prior probability functionπ(θ|M).The inferential computation may be incremental: the data may beadded in steps and new or additional observations may be moti-vated at each step, true to the scientific method. In the end, thisapproach may be generalised to locating the most likely models inthe generalised space of models; this leads to goodness-of-fit andmodel comparison tests.

For scientists, the ideal statistical inference is one thatlets thedata “speak for themselves.” This is achievable in some cases. Forexample, estimating a small number of parameters given a largedata set tends to be independent of prior assumptions. On theotherhand, hypothesis tests of two complex models may depend sensi-tively on prior assumptions. For a trivial example, some choicesof parameters may be unphysical even though they yield good fitsand should be excluded from consideration. Moreover, if twocom-peting models fit the data equally well, any hypothesis test willbe dominated by prior information. Inferences based on realisticmodels of astronomical systems will often lie between thesetwoextremes. For these cases, I advocate Bayesian methods becausethey precisely quantify both the scientists’ prior knowledge and theinformation gained through observation. In other words, weallowthe data to “speak for themselves” but in a “dialect” of our choos-ing. Philosophy aside, Bayes theorem simply embodies the law ofconditional probability and provides a rigorous frameworkfor com-bining the prior and derived information.

With these advantages comes a major disadvantage: Bayesianinference is computationally expensive! An inference may requirea huge sample from the posterior distribution and real-world com-putation ofP (D|θ,M) is often costly. Moreover, naive MCMCalgorithms used for sampling the posterior distribution convergeunacceptably slowly for distributions with multiple modes, andadvanced techniques to improve the mixing between modes areneeded, increasing the expense. Finally, Bayesian inference re-quires integrals over typically high-dimensional parameter spaces.For example, Bayes factors (§2.3) require the computation of themarginal likelihood:

P (D|M) =

∫

dθ π(θ|M)P (D|θ,M).

Evaluation of this integral suffers from the curse of dimensionality.Nonetheless, the elegance and promise of Bayesian inference

motivated us to attempt a computational solution and this becamethe BIE project. The algorithms and techniques described here, alland more available in the BIE, have proved useful to address thecomplications found in research problems. In short, the BIEfillsa gap between tools developed for small-scale problems or thosedesigned to test new algorithms and a computational platform de-signed for production-scale inference problems typical ofpresent-

day astronomical survey science. Its primary product is a samplefrom a posterior distribution to be used for parameter estimationand model selection. Other Bayesian applications, such as non-parametric inference and clustering, should be possible with littlemodification, but have not yet been investigated. The BIE is de-signed to run on high-performance computing clusters, although itwill also run on workstations and laptops.

The open object-oriented architecture allows for cross-fertilisation between researchers and groups with both mathemat-ical and scientific interests, e.g. both those developing new algo-rithms and those developing new astronomical models for differ-ent applications. New classes contributed by one become availableto all users after an upgrade. Approaches implemented by theoneuser’s new classes may solve an unanticipated set of problems forother users. In this way, the BIE is a distributed collaborative sys-tem similar to packages like IDL or modules in Python. I anticipatethat users with a variety of technical skill levels will use the BIE. Byreusing and modifying supplied examples, a user’s model of alike-lihood function can be straightforwardly added to the system with-out any detailed knowledge of the internals. A user’s new modelbecomes an internally-documented first-class object within the BIEby following the examples as templates. There is also room for theexperienced programmer to improve the low-level parallelism orimplement more efficient heuristics for likelihood evaluations. TheBIE includes a full persistence subsystem to save the state and datafor running a MCMC simulation. This facilitates both checkpoint-ing and recovery as well as later use of inferred posterior distribu-tions in new and unforeseen ways. A future version will implementa built-in database for warehousing results including the origin andhistory of both the data and computation, along with labels,notesand comments. Altogether, this will constitute an electronic note-book for Bayesian inference.

This paper provides some concrete examples of parameter es-timation and model comparison using the BIE. These do not ex-haust the full BIE feature set but, rather, provide a quick intro-duction to illustrate and motivate its use. The BIE providestheastronomer an organisational and computational schema that dis-criminates between models or hypotheses and suggests the best useof scarce observational resources. In short, if we can make betteruse of the interdependency of our observations given our hypothe-ses, then we can generate a far clearer picture of the underlyingphysical mechanisms. In many ways, the Bayesian approach em-ulates the empirical process: begin with a scientific beliefor ex-pectation, add the observed data and then modify the extent of thatbelief to generate the next expectation. In effect, as more and moredata is added to the model, the accurate predictions become rein-forced and the inaccurate ones rejected. This approach relieves thescientist from directly confronting the complex interdependencieswithin the data since those interdependencies are automatically in-corporated into the model. Although our scientific motivations areastronomical, the BIE can be applied to many different complexsystems and may even find applications in areas as diverse as bio-logical systems, climate change, and finance.

6 ACKNOWLEDGEMENTS

This material is based upon work supported by the National ScienceFoundation under Grant Nos. 0611948 and 1109354 and by NASAAISR Program through award NNG06GF25G. The initial develop-ment of the BIE was previously funded by NASA’s Applied Infor-mation and System Research (AISR) Program. I thank Neal Katz

c© 2012 RAS, MNRAS000, 1–17

16

and Eliot Moss for comments on an early draft of this manuscriptAlison Crocker, Neal Katz, Yu Lu, and Michael Petersen for com-ments on the final draft.

REFERENCES

Berntsen J., Espelid T. O., Genz A., 1991, ACM Trans. Math.Soft., 17, 437451 3.2

Gelman A., 2003, International Statistical Review, 71, 3693.5.1Gelman A., Carlin J., Stern H., Rubin D., 1995, Bayesian DataAnalysis. Chapman and Hall 2.2, 2.3

Gelman A., Rubin D. B., 1992, Stat. Sci., 7, 457 3.4Geyer C. J., 1991, in Keramidas ed., Computing Science andStatistics: Proceedings of the 23rd Symposium on the Infer-face Markov Chain Monte Carlo maximum likelihood. InterfaceFoundation, pp 156–163 3.1.2

Giakoumatos S., Vrontos I., Dellaportas P., Politis D., 1999, J.Comput. Graph. Statistics, 8, 431 3.4

Green P. J., 1995, Biometrica, 82, 711 3.3, 3.3, 4.1Hastings W. K., 1970, Biometrika, 57, 97 3.1.1Jeffreys H., 1961, Theory of Probability (3rd edition). OxfordUniversity Press 2.3

Kass R. E., Raftery A. E., 1995, Journal of the American Statisti-cal Association, 90, 773 2.3, 3.2

Kirkpatrick S., Gelatt C. D., Vecchi M. P., 1983, Science, 220, 6713.1.2

Lewis S. M., Raftery A. E., 1997, J. Am. Stat. Assoc., 440, 6483.2

Lindley D. V., 1957, Biometrika, 44, 187192 2.3Liu J. S., 2004, Monte Carlo Strategies in Scientific Computing.Springer Series in Statistics, Springer 3.1.1

Lu Y., Mo H. J., Katz N., Weinberg M. D., 2012, MNRAS, inpress 3.5.1

Lu Y., Mo H. J., Weinberg M. D., Katz N., 2011, MNRAS, 416,1949 2.4, 3.1.4, 4.1, 2

Metropolis N., Rosenbluth A., Rosenbluth M., Teller A., Teller E.,1953, 21, 1087 3.1.1, 3.1.1

Neal R. M., 1996, Statistics and Computing, 6, 353 3.1.2, 3.1.2Newton M. A., Raftery A. E., 1994, Journal of the Royal Statisti-cal Society, Ser. B, 56, 3 3.2

Pearson K., 1900, Philosophical Magazine, 50, 157 1Press W. H., Teukolsky S. A., Vetterling W. T., Flannery B. P.,1992, Numerical Recipes in C, second edn. Cambridge Univer-sity Press, Cambridge 3.4

Price K., 1997, Dr. Dobbs Journal, 264, 18 3.1.4Raftery A. E., 1995, in Markov Chain Monte Carlo in PracticeHypothesis testing and model selection with posterior simulation2.3

Robert C., Casella G., 2004, Monte Carlo Statistical Methods, sec-ond edn. Texts in Statistics, Springer 1

Sersic J. L., 1963, Boletin de la Asociacion Argentina de Astrono-mia La Plata Argentina, 6, 41 4.2

Storn K., 1999, in Corne D., Dorigo M., Glover F., eds, , NewIdeas in Optimization. McGraw-Hill, London, pp 79–108 3.1.4

Storn R., Price K., 1997, Journal of Global Optimization, 11, 3413.1.4

Ter Braak C. J. F., 2006, Stat. Comput., 16, 239 3.1.4Trotta R., 2008, Contemporary Physics, 49, 71 1, 2.3Verdinelli I., Wasserman L., 1998, Ann. Statist., 26, 1215 2.2,3.5.2, 3.5.2

Weinberg M. D., 2012, Bayesian Analysis, accepted 2.3, 3.2,3.2,3.2

Yoon I., Weinberg M. D., Katz N. S., 2011, MNRAS, 414, 16254.2, 3

Yoon I., Weinberg M. D., Katz N. S., 2012, MNRAS, to be sub-mitted 3.1.5, 3.2, 3.2

APPENDIX A: BIE: TECHNICAL OVERVIEW

The BIE is a general-purpose system for simulating Bayesianposterior distributions for statistical inference and hasbeenstress tested using high-dimensional models. As describedinthe previous sections, the inference approach uses the Bayesianframework enabled by Monte Carlo Markov chain (MCMC)techniques. The software is parallel and multi-threaded andshould run in any environment that supports POSIX threads andthe widely-implemented Message Passing Interface (MPI, seehttp://www.mpi-forum.org). The package is written inC++ and developed on the GNU/Linux platform but it should portto any GNU platform.

BIE at its core is a software library of interoperable compo-nents necessary for performing Bayesian computation. The BIEclasses are available as both C++ libraries and as a stand-alone sys-tem with an integrated command-line interface. The command-lineinterface is well tested and is favoured by most users so far.A userdoes not need to be an expert or even an MPI programmer to use thesystem; the simple user interface is similar to MatLab or Gnuplot.In addition to the engine itself, the BIE package includes a num-ber of stand-alone programs for viewing and analysing output fromthe BIE, testing the convergence of a simulation, manipulating thesimulation output, and computing the marginal likelihood using thealgorithms described in§3.2. A future release will include wrappersfor Python.

A1 Software architecture

As in GIMP Toolkit (GTK+) and the Visualisation Toolkit (VTK),the BIE uses C++ to facilitate implementing an object-oriented de-sign. Object-oriented programming enforces an intimate relation-ship between the data and procedures: the procedures are responsi-ble for manipulating the data to reflect its behaviour. The softwareobjects (classes in C++) represent real-world probabilitydistribu-tions, mathematical operators and algorithms, and this presents anatural interface and set of interobject relationships to the user andthe developer. Programs that want to manipulate an object have tobe concerned only about which messages this object understands,and do not have to worry about how these tasks are achieved northe internal structure of the object.

Another powerful feature of object-oriented programming isinheritance(derived classes in C++). The derived class inherits theproperties of its base class and also adds its own data and routines.This structure makes it is easy to make minor changes in the datarepresentation or the procedures. Changes inside a class donot af-fect any other part of a program, since the only public interfacethat the external world has to a class is through the use of methods.This facilitates adding new features or responding to changing op-erating environments by introducing a few new objects and mod-ifying some existing ones. These features encourage extensibilitythrough the reuse of commonly used structures and innovation byallowing the user to connect components in new and possibly un-foreseen ways. In addition, this facilitates combining concurrently

c© 2012 RAS, MNRAS000, 1–17


developed software contributions from scientists interested in spe-cific models or data types and MCMC algorithms.

Motivated by handling large amounts of survey data with thesubsequent possible need to investigate the appropriate simulationfor a variety of models or hypotheses, the BIE separates the com-putation into a collection of subsystems:

(i) Data input and output(ii) Data distribution and spatial location(iii) Markov chain simulation(iv) Likelihood computation(v) Model and hypothesis definition

Each of these can be specified independently and easily mixedandmatched in our object-oriented architecture. The cooperative de-velopment enabled by this architecture is similar to that behind theopen-source movement, and this project is a testament to itssuccessin an interdisciplinary scientific collaboration. The BIE use SVNversion management (autoconf, automake), GNU coding standards,and DejaGNU regression testing to aid in portability. Moreover, Ibelieve that this same social model will extend to remote collabo-rative efforts as the BIE project matures.

A2 Persistence system

The researcher needs to be able to stop, restart, and possi-bly refocus inferential computations for both technical and sci-entific reasons. The BIE was designed with these scenariosin mind. The BIE’s persistence system is built on top of theBOOST (http://www.boost.org) serialisation library. TheBIE classes inherit from a base serialisation class that provides thekey serialisation members and a simple mnemonic scheme to markpersistent data in newly developed classes. The serialisation-basedpersistence system avoids irrevocably modifying data setsor files.Rather it views the computation as a series of functions, each ac-cepting one or more input data sets and producing one or more newoutput data sets. The system records and timestamps these com-putations and the relationships between inputs and outputsin anarchive research log, so that one can always go back and determinethe origin of data and how it was processed. The most common useof BIE persistence to date is checkpointing and recovery, Check-pointing guards against loss of computation by saving intermediatedata to support recovery in the middle of long-running computa-tional steps; and it allows one to “freeze” or “shelve” a computa-tion and pick it up later. It also provides the basic support neededto interrupt a computation, do some reconfiguring, and resume, aswhen machines need to be added to or removed from a cluster, etc.

A3 Extensibility

BIE is designed to be extensible. The user may define new classesfor any aspect of the MCMC simulation, such as MCMC algo-rithms, convergence tests, prior distributions, data types, and likeli-hood functions. The code tree includes aProjects directory thatis automatically compiled into any local build that may contain anylocally added functionality. For example, both the BIE-SAMandGalphat were derived from a base class specifically for user-definedlikelihood functions.

The source tree is available for download fromhttp://www.astro.umass.edu/bie. The packageincludes Debian and Ubuntu package management scripts sothat local.deb packages may be built. Users have had successbuilding other modern Linux distributions.

APPENDIX B: DEFINITION OF DISTRIBUTIONS

B1 Normal distribution

The normal (or Gaussian) distribution is the familiarbell-shapeddensity function that results from the central limit theorem:

N (θ;µ, σ2) =1

σ√2πe−

(θ−µ)2

2σ2 (B1)

where parameterµ is the mean andσ2 is the variance. See anystandard reference in probability and statistics for the multivariategeneralisation.

B2 Weibull distribution

The Weibull distribution has a scale parameter (λ) and shape pa-rameterκ. It includes the exponential distribution forκ = 1 butbecomes more peaked around aroundλ asκ increases. The proba-bility density function of a Weibull random variableθ is:

W(θ;λ, κ) =

κλ

(

θλ

)κ−1e−(θ/λ)κ θ ≥ 0,

0 θ < 0.(B2)

B3 Beta distribution

The beta distribution is a two-parameter density defined on the unitinterval. The probability density function of the beta distribution is:

PB(x;α, β) =Γ(α+ β)

Γ(α)Γ(β)xα−1(1− x)β−1 (B3)

whereΓ(·) is the gamma function.

B4 Dirichlet distribution

The Dirichlet distribution is the multivariater-dimensional gener-alization of the beta distribution with parametersαi, i = 1, . . . , r.Thinking of each dimension as a separate type of eventi, this distri-bution describes the probability of events where each type of eventhas been previously observedαi − 1 times. The Dirichlet distribu-tion of orderr ≥ 2 with parametersαi has the probability densityfunction:

PD(x1, . . . , xr;α1, . . . , αr) =Γ(

∑ri=1 αi)

∏ri=1 Γ(αi)

r∏

i=1

xαi−1i (B4)

wherex1, . . . , xr−1 > 0 andxr = 1−∑r−1i=1 xi > 0.

c© 2012 RAS, MNRAS000, 1–17

http://www.boost.org


arxiv:1203.3816v1 [astro-ph.im] 16 mar 2012web.ipac.caltech.edu/staff/fmasci/home/astro_refs/... ·...

Documents