bayesian inference in high-dimensional modelssghosal/papers/bayesian hd...bayesian inference in...

34
Bayesian inference in high-dimensional models Sayantan Banerjee, Ismael Castillo and Subhashis Ghosal Abstract Models with dimension often more than the available sample size are now commonly used in various applications. A sensible inference is possible us- ing a lower-dimensional structure. In regression problems with a large number of predictors, the model is often assumed to be sparse, with only a few predictors ac- tive. Interdependence between a large number of variables is succinctly described by a graphical model, where variables are represented by nodes on a graph and an edge between two nodes is used to indicate their conditional dependence given other variables. Many procedures for making inferences in the high-dimensional setting, typically using penalty functions to induce sparsity in the solution obtained by min- imizing a loss function, were developed. Bayesian methods have been proposed for such problems more recently, where the prior takes care of the sparsity structure. These methods have the natural ability to also automatically quantify the uncertainty of the inference through the posterior distribution. Theoretical studies of Bayesian procedures in high-dimension have been carried out recently. Questions that arise are whether the posterior distribution contracts near the true value of the param- eter at the minimax optimal rate, whether the correct lower-dimensional structure is discovered with high posterior probability, and if a credible region has adequate frequentist coverage. In this paper, we review the properties of Bayesian and related methods for several high-dimensional models such as many normal means problem, linear regression, generalized linear models, Gaussian and non-Gaussian graphical models. Effective computational approaches are also discussed. Sayantan Banerjee Operations Management and Quantitative Techniques Area, Indian Institute of Management, Rau- Pithampur Road, Indore, MP 453556, India. e-mail: [email protected] Ismael Castillo Laboratoire de Probabilit´ es, Statistique et Mod´ elisation, Sorbonne Universit´ e, Case Courrier 188, 4 Place Jussieu, 75252 Paris Cedex 05, France. e-mail: [email protected] Subhashis Ghosal Department of Statistics, North Carolina State University, Raleigh, NC 27695-8203, U.S.A. e-mail: [email protected] 1

Upload: others

Post on 19-Feb-2021

1 views

Category:

Documents


0 download

TRANSCRIPT

  • Bayesian inference in high-dimensional models

    Sayantan Banerjee, Ismael Castillo and Subhashis Ghosal

    Abstract Models with dimension often more than the available sample size arenow commonly used in various applications. A sensible inference is possible us-ing a lower-dimensional structure. In regression problems with a large number ofpredictors, the model is often assumed to be sparse, with only a few predictors ac-tive. Interdependence between a large number of variables is succinctly describedby a graphical model, where variables are represented by nodes on a graph and anedge between two nodes is used to indicate their conditional dependence given othervariables. Many procedures for making inferences in the high-dimensional setting,typically using penalty functions to induce sparsity in the solution obtained by min-imizing a loss function, were developed. Bayesian methods have been proposedfor such problems more recently, where the prior takes care of the sparsity structure.These methods have the natural ability to also automatically quantify the uncertaintyof the inference through the posterior distribution. Theoretical studies of Bayesianprocedures in high-dimension have been carried out recently. Questions that ariseare whether the posterior distribution contracts near the true value of the param-eter at the minimax optimal rate, whether the correct lower-dimensional structureis discovered with high posterior probability, and if a credible region has adequatefrequentist coverage. In this paper, we review the properties of Bayesian and relatedmethods for several high-dimensional models such as many normal means problem,linear regression, generalized linear models, Gaussian and non-Gaussian graphicalmodels. Effective computational approaches are also discussed.

    Sayantan BanerjeeOperations Management and Quantitative Techniques Area, Indian Institute of Management, Rau-Pithampur Road, Indore, MP 453556, India. e-mail: [email protected]

    Ismael CastilloLaboratoire de Probabilités, Statistique et Modélisation, Sorbonne Université, Case Courrier 188,4 Place Jussieu, 75252 Paris Cedex 05, France. e-mail: [email protected]

    Subhashis GhosalDepartment of Statistics, North Carolina State University, Raleigh, NC 27695-8203, U.S.A. e-mail:[email protected]

    1

  • 2 Sayantan Banerjee, Ismael Castillo and Subhashis Ghosal

    1 Introduction

    Advances in technology have resulted in massive datasets collected from all as-pects of modern life. Datasets appear from internet searches, mobile apps, socialnetworking, cloud-computing, wearable devices, as well as from more traditionalsources such as bar-code scanning, satellite imaging, air traffic control, banking, fi-nance, and genomics. Due to the complexity of such datasets, flexible models areneeded involving many parameters, routinely exceeding the sample size. In such asituation, a sensible inference is possible only if there is a hidden lower-dimensionalstructure involving far fewer parameters. This will happen in a linear or a general-ized linear regression model if the vector of regression coefficients mostly consistsof zero entries. In this situation, sensible inference will be possible by forcing theestimated coefficient to be sparse through an automatic mechanism determined bythe data. Another important problem is studying the interrelationship among a largeclass of variables. It is sensible to think that only a few pairs have some intrinsicrelations between them, when the effects of other variables are eliminated, that is,most pairs of variables are conditionally independent given other variables. The un-derlying sparsity is very conveniently described by a graph, where the variables arerepresented by the nodes of a graph, and an edge connecting a pair is present onlyif they are conditionally dependent given other variables. Hence the resulting modelis called a graphical model. When the variables are jointly Gaussian, the absence ofan edge is equivalent to having a zero-entry in the precision (inverse of the covari-ance) matrix. Thus learning the structural relationship in such a Gaussian graphicalmodel is possible by forcing the estimated precision matrix to have zero-entries inmost places. Other problems which effectively use a lower-dimensional structure in-clude matrix completion problems (where many entries of a matrix are missing andit is assumed that the underlying true matrix has a sparse plus a low-rank structure),and stochastic block models (where the extent of interaction between two nodes isdetermined solely by their memberships in certain hidden blocks of nodes).

    Numerous methods of estimating parameters in the high-dimensional settinghave been proposed in the literature, most of which use the penalization approach.The idea is to add a suitable penalty term to the loss function to be optimized sothat the resulting solution is forced to be sparse. The most familiar method is thelasso (Tibshirani [96]): β̂ = argmin∑ni=1(Yi−∑

    pj=1 β jXi j)

    2+λ ∑pj=1 |β j| for a linearregression model Yi = ∑pj=1 β jXi j + εi, i = 1, . . . ,n. The sharp corners of the con-tours of the `1-penalty function ∑pj=1 |β j| in this case force sparsity in the resultingestimate, where the extent of the sparsity depends on the tuning parameter λ and thedata. While there are many other important penalization procedures for this prob-lem and many other related problems, we shall not explicitly refer to them exceptto introduce and compare with a Bayesian method, which is the primary object ofinterest in this review. An excellent source of information on the frequentist liter-ature for high-dimensional models is Bühlmann and van de Geer [19]. Bayesianmethods for high-dimensional models have seen a lot of recent interest. A hiddenlower-dimensional structure such as sparsity can be easily incorporated in a prior

  • Bayesian inference in high-dimensional models 3

    distribution, for instance, by allowing a point-mass (or some analog of that) at zero.On top of providing a point estimate, a Bayesian method has the natural ability toassess the uncertainty in structure-learning and provide credible sets with attachedprobabilities for uncertainty quantification. Good Bayesian procedures should havedesirable frequentist properties, like minimax optimal rate of contraction, consis-tency of variable selection (or more generally, structure learning), and asymptoticfrequentist coverage of credible sets. Investigating these properties is a primary re-search objective, which is particularly important in high-dimensional settings be-cause identifying appropriate priors is more challenging. Continued research hasshown that under suitable choices, Bayesian methods can asymptotically performvery well. However, a Bayesian method is typically also very computationally in-tensive, especially if a traditional Markov chain Monte Carlo (MCMC) procedureis used, since the MCMC iterations will have to explore a lot of possible mod-els. In recent years, effective computational strategies using continuous shrinkagepriors, expectation-maximization (EM) algorithm, variational approaches, Hamilto-nian Monte Carlo and Laplace approximation, have been proposed. Our review willalso address some aspects of computation in the supplementary materials part.

    We shall cover several models of interest: many normal-means model; high-dimensional linear regression model; high-dimensional nonparametric regression;other regression models; high-dimensional classification; Gaussian graphical model;Ising and other non-Gaussian graphical models; nonparanormal graphical models;matrix models such as structured sparsity and stochastic block models. We shall ad-dress issues such as posterior contraction, variable or feature selection, distributionalapproximation and coverage of credible sets. Effective computational techniques us-ing continuous shrinkage priors and other innovative means will be discussed in theSupplementary Materials part, as well as modifications to the Bayesian approachusing fractional posterior and PAC-Bayesian methods.

    A generic statistical model indexed by a parameter θ is written as P = {P(n)θ , θ ∈Θ}, with Θ a subset of a (typically, here, high-dimensional) metric space equippedwith a distance d. Let Π stand for a prior distribution on Θ , and suppose one ob-serves data Y (n) ∼ P(n)θ . Let the true value of θ be denoted by θ0 so that we studyconvergence under the distribution Pθ0 . We say that εn is a contraction rate (at θ0,with respect to a metric d on Θ ) if Eθ0Π [d(θ ,θ0) ≤Mnεn|X (n)]→ 1 as n→ ∞ forevery sequence Mn→ ∞; see Ghosal and van der Vaart [41]. Let N(0,Σ) stand forthe centered Gaussian distribution with covariance Σ and Lap(λ ) denote the Laplacedistribution of parameter λ > 0. Let Id be the identity matrix of dimension d ≥ 1,‖ · ‖p, p≥ 1, be the `p-norm and ‖ · ‖ stand for the Euclidean i.e., ‖ · ‖2-norm.

  • 4 Sayantan Banerjee, Ismael Castillo and Subhashis Ghosal

    2 Priors for sparsity

    2.1 Spike-and-slab priors

    As sparsity plays a very important role in high-dimensional modeling, priors in-ducing sparsity are of special interest. While an entry θ may likely be zero, thepossibility of a non-zero value, even a large value, cannot be ruled out. This canbe thought of as being two regimes superimposed — one corresponding to zero orvery small values of θ , and the other to possibly large values, and the prior shouldgive a probabilistic mechanism to choose between the regimes. To address this, aspike-and-slab prior has been considered (Mitchell and Beauchamp [72], Ishwaranand Rao [48]):

    π(θ) = (1−w)φ0(θ)+wφ1(θ), (1)

    where φ0 is a density highly concentrated at 0, φ1 is a density (usually symmetricabout 0) allowing intermediate and large values of θ , and w is a small parameterthus inducing sparsity in the mixture. For instance, φ0 may be the normal densitywith mean 0 and a small variance v0 and φ1 the normal density with mean zero anda relatively large variance v1. The parameters in φ0 and φ1 as well as w may begiven further priors. Another choice such that both φ0 and φ1 are Laplace densities,was proposed and called the spike-and-slab lasso (Roc̆kovà and George [91]). Theprimary motivation is to identify the posterior mode as a sparse vector as in the lasso.Generally, the spike part of the prior induces a shrinkage towards zero, which can belimited by using a heavier tailed density φ1 for the slab such as a t-density, or at leastas heavy-tailed as the Laplace density. For the spike part, an extreme choice is thedistribution degenerate at 0, which corresponds to exact sparsity, and the resultingprior will be referred to as the hard-spike-and-slab prior, while the term soft-spike-and-slab prior will be used if the spike has a density. Johnson and Rossell [50, 51]argued in favor of non-local priors which make the spike as separated as possiblefrom the slab, by choosing slab distributions that have very little mass close to 0.

    2.2 Continuous shrinkage priors

    Computation using a spike-and-slab prior involves a latent indicator of the mixturecomponent. Replacing the indicator by a continuous variable leads to the so-calledcontinuous shrinkage priors, typically obtained as scale mixtures of normal. Al-ternative terms like global-local priors or one-component priors are also used. Anearly example is the Laplace prior (Park and Casella [83], Hans [47]), which is anexponential scale-mixture of normal, often called the Bayesian lasso because thecorresponding mode is interpreted as the lasso. However, the obvious drawback ofthese priors is that there is not sufficient concentration near the value 0 for the entireposterior to concentrate near 0 whenever a coefficient is 0, even though the posteriormode may be sparse or nearly sparse. The prior concentration near 0 should be high

  • Bayesian inference in high-dimensional models 5

    while still maintaining a thick tail, by letting the scale parameter to have a morespiked density at 0. A popular continuous shrinkage prior meeting the requirementis given by the horseshoe prior (Carvalho et al. [23]), which is a half-Cauchy scalemixture of normal:

    θ |λ ∼ N(0,λ 2), λ ∼ Cauchy+(0,τ), (2)

    where Cauchy+(0,τ) is the half-Cauchy distribution with scale τ . The correspond-ing marginal density of θ has a pole at 0 and Cauchy-like tails. A further priormay be put on τ , such as another half-Cauchy prior leading to the Horseshoe+ prior(Bhadra et al. [15]).

    One may consider different priors on the scales λ in (2) with a high concentra-tion near 0. The scale parameter λ is entry-specific and is called the local shrink-age prior. The scale parameter τ is common for all entries and is called the globalshrinkage parameter. The local shrinkage parameter should have a heavy tail whilethe global shrinkage parameter should have a high concentration at 0 (Polson andScott [87]), respectively controlling the tail and the sparsity. Various choices of mix-ing distributions lead to the introduction of many continuous shrinkage priors suchas normal-inverse-Gaussian prior (Caron and Doucet [21]), normal-gamma prior(Griffin and Brown [44]), the generalized double Pareto priors (Armagan et al. [1]).Another possibility is the Dirichlet-Laplace prior (Bhattacharya et al. [17]):

    θi|φ ,τ ∼ Lap(φiτ), φ = (φ1, . . . ,φp)∼ Dir(a, . . . ,a), (3)

    where choosing a ∈ (0,1) leads to a pole at 0 for the marginal distribution of θi’senforcing (near-)sparsity, and τ is a parameter given a gamma distribution.

    3 Normal sequence model

    The simplest high-dimensional model is given by the normal sequence model:Yi = θi + εi, i = 1, . . . ,n, where εi are i.i.d. N(0,1), the parameter set Θ forθ = (θ1, . . . ,θn) is Rn. The posterior contraction rate will be obtained uniformly forthe true θ0 belonging to the nearly-black class `0[s] = {θ ∈ Rn : #{i : θi 6= 0} ≤ s},0 ≤ s ≤ n, where # stands for the cardinality of a finite set. It is known that theasymptotic minimax risk (in terms of squared error loss) is 2s log(n/s).

    3.1 Recovery using hard-spike-and-slab priors

    It is clear that if each θi is given Lap(λ )-prior independently, the resulting pos-terior mode will be the lasso, which, with the choice λ �

    √logn, converges to

    the true θ at the nearly optimal rate s logn, but the whole posterior has a sub-optimal contraction property (Castillo and van der Vaart [30]). This is because,

  • 6 Sayantan Banerjee, Ismael Castillo and Subhashis Ghosal

    without having sufficient prior concentration at 0, the posterior cannot contractsufficiently fast near the truth. A remedy is to assign an additional point-mass atzero using a hard-spike-and-slab prior: for w ∈ [0,1] and Γ a distribution on R,Πw = Πw,Γ =

    ⊗ni=1{(1−w)δ0 +wΓ }. The weight parameter w in the ‘oracle’ sit-

    uation (i.e. s is known) can be taken to be s/n. Realistically, the choice w = c/nis possible with a constant c > 0, but this leads to the slightly suboptimal rates logn. In this case, under the prior, the expected number of nonzero coefficientsis of the order of a constant. To obtain an improved data fit, options performingbetter in practice include an empirical Bayes choice ŵ of w [George and Foster[37], Johnstone and Silverman [52]]. One possibility is to use an ad-hoc estimatorof the number of non-zero coordinates of θ , for instance by keeping only coor-dinates above the expected noise level ŵ = n−1 ∑ni=11{|Yi| >

    √2logn}. However,

    this choice may be too conservative in that signals below the universal√

    2lognthreshold may not be detected. The marginal maximum likelihood empirical Bayesapproach (MMLE) consists of integrating out θ and maximizing with respect to w:ŵ = argmax ∏ni=1 ((1−w)φ(Yi)+wg(Yi)), where g = γ ∗ φ . The plug-in posteriorΠŵ(·|Y1, . . . ,Yn) was advocated and studied in George and Foster [37], Scott andBerger [93], Johnstone and Silverman [52], Castillo and Mismer [24] among others,and is shown to possess the optimal contraction rate for heavy enough tail of theslab. Castillo and van der Vaart [30] considered the hierarchical Bayes using a priorw∼ Beta(1,n+1). More generally, they also considered a subset-selection prior: ifπn a prior on the set {0, . . . ,n} and Ss the collection of all subsets of {1, . . . ,n} ofsize s, let Π be constructed as

    s∼ πn, S|s∼ Unif(Ss), θ |S∼⊗i∈S

    Γ ⊗⊗i/∈S

    δ0. (4)

    The hard-spike-and-slab prior is a special case where πn is the binomial distributionBin(n,w). Let θ0 stand for the true value of the vector θ and s0 the cardinality ofthe support of θ0, i.e., the set {i : |θ0i 6= 0}. Through the prior πn, it is possible tochose dimensional priors that ‘penalize’ large dimensions more than the binomialand achieve the optimal rate s0 log(n/s0) uniformly over `0[s0], for instance, us-ing the complexity prior πn(s) ∝ exp[−as log(bn/s)]; see Castillo and van der Vaart[30]. Finally, deriving the optimal constant 2 in the minimax sense for posterior con-traction is possible for some carefully designed priors (Castillo and Mismer [25]).

    It may be mentioned that the convergence speeds of the entire posterior distri-bution and those of aspects of it such as the posterior mean, median or mode maynot match. Like the posterior mode for the Laplace prior, the (empirical or hier-archical Bayes) posterior mean for the hard-spike-and-slab prior may also be lo-cated ‘relatively far’ from the bulk of the posterior mass in terms of the distancedq(θ ,θ ′) = ∑ni=1 |θi−θ ′i |q, 0 < q < 1 (Johnstone and Silverman [52], Castillo andvan der Vaart [30]). In contrast, the posterior distribution concentrates at the mini-max rate for the dq-loss for any q ∈ (0,2] (Castillo and van der Vaart [30]). Further,the coordinate-wise median of the MMLE empirical Bayes posterior for the hard-spike-and-slab prior converges at the minimax rate (Johnstone and Silverman [52])but the plug-in posterior Πŵ(·|Y1, . . . ,Yn) converges at a suboptimal rate for certain

  • Bayesian inference in high-dimensional models 7

    sparse vectors (Castillo and Mismer [24]). This is due to a too large dispersion-termin the plug-in posterior when ŵ is slightly above s/p. The problem is avoided in thehierarchical approach through the additional regularization by the beta prior on w.

    3.2 Uncertainty quantification (using normal slab)

    If the slab distribution Γ is chosen normal, then the posterior is conjugate con-ditional on the selection, and hence allows certain explicit expressions, which arebeneficial. However, due to quickly decaying tails of normal distributions, this leadsto over-shrinkage of large values leading to suboptimality, or even inconsistency.The problem can be avoided by introducing a mean parameter for the slab distri-bution of each entry and estimating those by empirical Bayes, which is the same asplugging in the observation itself; see Martin and Walker [70] who used a fractional-posterior, and Belitser and Nurushev [13], who considered the usual posterior dis-tribution. Belitser and Nurushev [13] obtained optimal contraction rate following anoracle approach to optimality, and also obtained credible regions with frequentistcoverage and adaptive optimal size under an “excessive bias restriction” conditionthat controls bias by a multiple of variability at a parameter value. Restricting pa-rameter space in such a manner is essential as honest coverage with adaptive sizeball is impossible to obtain by any method, Bayesian or not, in view of impossibilityresults such as Baraud [8]. Castillo and Szabó [29] considered an empirical Bayesapproach using an estimated value of the hyperparameter w in the prior for the spar-sity level and provided similar coverage results for adaptive credible sets. They alsoshowed that the excessive bias condition is necessary in a minimax sense.

    3.3 Alternative shrinkage priors

    While subset selection priors are particularly appealing because of their naturalmodel selection abilities, to address less strict sparsity (such as weak or strong `pclasses), one may use a soft-spike-and-and slab prior, to get essentially the sameposterior contraction rate provided that the spike distribution is sufficiently con-centrated, such as by the spike-and-slab-lasso with the spike distribution Lap(λ0)for λ0→ ∞ with n and the slab distribution Lap(λ1) for constant λ1 (Roc̆kovà andGeorge [91]). For the horseshoe prior, van der Pas et al. [102, 100] obtained explicitexpressions for the posterior mean in terms of degenerate hypergeometric functions,and used that to show posterior contraction at nearly black vectors at the rate s0 logn,respectively under known sparsity and unknown sparsity regimes; see also Ghoshand Chakraborti [42] for similar results. The optimal posterior contraction rate usingthe Dirichlet-Laplace prior was obtained by Bhattacharya et al. [17] under a growthcondition on the norm ‖θ0‖. The optimal contraction rates for the spike-and-slab-lasso in its hierarchical form was obtained in Roc̆kovà [89] up to a side-condition on

  • 8 Sayantan Banerjee, Ismael Castillo and Subhashis Ghosal

    signals, and by Castillo and Mismer [24] for the empirical Bayes form. Van der Paset al. [103] unified results for scale-mixture of normals with known sparsity param-eter, obtaining sufficient and almost necessary conditions for the optimal conver-gence. Naturally, to detect the sparsity structure using continuous shrinkage priors,a thresholding procedure identifying an entry essentially zero is needed. Credible`2-balls for the parameter and marginal credible with asserted frequentist coverageswere obtained in van der Pas et al. [101].

    3.4 Multiple testing

    The problem of identifying a sparse structure can also be thought of as a multipletesting problem, in which case measures like false discovery rate (FDR) (Benjamainiand Hochberg [14]) may be considered instead of the usually adopted family-wiseerror rate for Bayesian model selection. The FDR control may fail for many priors.In Castillo and Roquain [27], a uniform FDR control is derived for procedures basedon both `-values Π(θi = 0|Y1, . . . ,Yn) and q-values Π(θi = 0|Yi ≥ yi), when theprior distribution has suitably heavy slab tails. It is shown that thresholding the `-values at a given level t leads to an FDR going to 0 uniformly at a logarithmicrate. The procedure thresholding q-values at a level t, on the other hand, controlsthe FDR at level a constant times t uniformly, and exactly t asymptotically whensignals are all above the threshold a

    √2log(n/s), for any a > 1. Early results for

    continuous shrinkage priors including the horseshoe were obtained in Salomond[92]; see also van der Pas et al. [101] for a simulation study. In the Bayesian setting,a very appealing measure is provided by the Bayes risk for testing. With a hard-spike-and-(normal) slab prior with variance v1 and known w, the oracle Bayes rulefor rejecting H0i : θi = 0 can be shown to be a thresholding procedure |Yi|2 > (1+v−11 )[log(1+ v1)+2log((1−w)/w)]; see Datta and Ghosh [31] for details.

    4 High-dimensional regression

    A natural generalization of the normal sequence model is the situation the meanθi depends on a (high-dimensional) covariate Xi associated with the observation Yi,i, . . . ,n. The most popular statistical model in this setting is given by the normal lin-ear regression model Yi = β>Xi + εi, i = 1, . . . , p, where Xi ∈ Rp is a deterministicp-dimensional predictor, β ∈ Rp is the linear regression coefficient, and εi are i.i.d.N(0,σ2) error variables, as before. When (p4 log p)/n→ 0, posterior concentrationand a Bernstein-von Mises theorem were obtained in Ghosal [39] without assum-ing any sparsity. We are mainly interested in the high-dimensional setting where pis very large (possibly even much larger than n), and β is sparse with only s co-ordinates of it being non-zero, s� n and n→ ∞.

  • Bayesian inference in high-dimensional models 9

    4.1 Linear regression with hard-spike-and-Laplace slab

    Let X., j be the jth column of X := ((Xi j)), and consider the norm ‖X‖= max‖X., j‖.We consider the prior (4), with gS the product of |S| Laplace densities β 7→(λ/2)exp(−λ |β |). We set λ = µ‖X‖, for p−1 ≤ µ ≤ 2

    √log p and put a prior

    πp(s) ∝ c−s p−as on the number of non-zero entries of β . For p > n, clearly βcannot be uniquely recovered from Xβ (even in the noiseless case). However, ifβ is assumed to be sparse (with only s� n components non-zero) and the sub-matrix of X corresponding to the active predictors is full rank s, it turns out thatβ can be recovered. Define the compatibility number of model S ⊂ {1, . . . , p} byφ(S) := inf

    {‖Xβ‖|S|1/2/‖X‖‖βS‖1 : ‖βSc‖1 ≤ 7‖βS‖1, βS 6= 0

    }, where βS = (βi :

    i ∈ S), and the `p-compatibility number in vectors of dimension s by φ p(s) :=inf{‖Xβ‖|Sβ |1−p/2/‖X‖‖β‖p : 0 6= |Sβ | ≤ s

    }, p = 1,2, where Sβ = { j : β j 6= 0},

    the support of a sparse vector β . For β0 the true vector of regression coefficientsand S0 = Sβ0 and s0 = |S0|, we assume that, with p = 1 or 2 depending on the con-text, min

    (φp(S0),φ(Cs0)

    )≥ d > 0, where C is a suitably large constant depending

    on S0,a,µ . More information about compatibility numbers may be found in van deGeer and Bühlmann [99]. When entries of X are sampled randomly, the compat-ibility numbers are often very well-behaved with high probability. In the leadingexample where Xi j are independently sampled from N(0,1), the compatibility num-bers above of models up to dimension a multiple of

    √n/ log p are bounded away

    from zero (Cai and Jiang [20], van de Geer and Muro [98]).The following conclusions were derived in Castillo et al. [28]. First, a useful

    property is that the posterior of β sits on models of dimensionality at most a constantmultiple of the true dimension s0, thereby returning models at least of the sameorder of sparsity as the true one. Further, we can recover β0 in terms of the `1-norm:when the `1-compatibility numbers are bounded away from 0, Eβ0Π(‖β −β0‖1 >Ms0√

    log p/‖X‖|Y1, . . . ,Yn)→ 0. The corresponding rate in terms of the Euclideandistance is

    √s0 log p/‖X‖, assuming that the `2-compatibility numbers are bounded

    away from 0. The corresponding rate is√

    log p/‖X‖ with respect to the maximum-loss under a stronger condition, called mutual coherence. The rate for prediction,i.e., bound for ‖Xβ − Xβ0‖, is of the order

    √|S0| log p under a slightly adapted

    compatibility condition. The convergence results are uniform over the parameterspace under the boundedness conditions on the compatibility numbers, and matchwith those of celebrated estimators in the frequentist literature.

    For sparse regression, variable selection, which is the identification of non-zerocoefficients, is extremely important, because that allows simpler interpretation and abetter understanding of relations. To address the issue, Castillo et al. [28] developeda technique based on a distributional approximation under relatively low choices ofthe parameter λ , known as the small-λ regime. This is similar to the normal ap-proximation to the posterior distribution in the Bernstein-von Mises theorem, butthe difference is that in this context, the approximating distribution is a mixture ofsparse normal distributions over different dimensions. Then the problem of variable

  • 10 Sayantan Banerjee, Ismael Castillo and Subhashis Ghosal

    selection can be transferred to that for the approximate posterior distribution, andit can be shown that no proper superset of the correct model can be selected withappreciable posterior probability. It is, however, possible to miss signals that are toosmall in magnitude. If all non-zero signals are assumed to be at least of the order√|Sβ0 | log p/‖X‖, then none of these signals can be missed, because missing any

    introduces an error of the magnitude of the contraction rate. Thus selection consis-tency follows. Also, under this situation, distributional approximation reduces to asingle normal component with sparsity exactly as in the true coefficient. Ning et al.[79] considered a useful extension of the setting of Castillo et al. [28] by lettingeach response variable Y to be d-dimensional with d also increasing to infinity (at asub-polynomial growth with respect to n), and having completely unknown d× d-covariance matrix Σ and the regression has group-sparsity. They used a prior onΣ using Cholesky decomposition and used a hard-spike-and-slab prior with multi-variate Laplace slab on the group of regression coefficients selected together andobtained squared posterior contraction rate

    ε2n = max{(s0 logG)/n,(s0 pmax logn)/n,(d2 logn)/n

    }, (5)

    where G stands for the total number of groups of predictors, s0 the number of activegroups and pmax the maximum number of predictors in a group, provided that theregression coefficients and the covariance matrix are appropriately norm-bounded.

    4.2 Linear regression using other priors

    Instead of the hard-spike-and-slab prior with Laplace slab, we may use a normalslab, but use an undetermined mean selected by the empirical Bayes method, as inthe sequence model. The advantage is that posterior conjugacy given the selectedsubset can be retained, allowing explicit expressions. Belitser and Ghosal [11] fol-lowed the oracle approach of Belitser [10] to quantifying risk locally at every pointto define the optimal concentration and established results analogous to Castillo etal. [28]. Moreover, uniformly over the collection of all parameter values satisfyingthe excessive bias restriction condition, they showed that an appropriately inflatedBayesian credible ball of the optimal size has adequate frequentist coverage simul-taneously for any sparsity level. Other references of Bayesian high-dimensional lin-ear regression includes Narisetty and He [78], who considered an alternative to theposterior measure called the skinny Gibbs posterior that avoids an important com-putational bottleneck, and established its contraction at the true coefficient at theoptimal rate.

    An alternative to using independent hard-spike-and-Laplace slab is to use an el-liptical Laplace prior on the selected coefficients through a set selection prior inthe spirit of (4), but adjusted for the normalizing constant appearing in the ellipti-cal Laplace distribution. Gao et al. [36] considered this prior for a structured linearmodel Y = LX β +ε , where LX is a linear operator depending on X and ε is a vector

  • Bayesian inference in high-dimensional models 11

    of errors assumed to have only sub-Gaussian tails, and obtained minimax contrac-tion rates. Apart from linear regression, their setup includes stochastic block models,biclustering, linear regression with group sparsity, multi-task learning, dictionarylearning, and others.

    In spite of the attractive theoretical properties, posterior computation based ontwo-component priors is computationally intensive. A faster computation may bepossible using continuous shrinkage priors (Bhattacharya et al. [16]). Under cer-tain general conditions on prior concentration near zero, the thickness of the tailsand additional conditions on the eigenvalues of the design matrix, Song and Liang[95] derived posterior contraction and variable selection properties. Their resultscover a wide variety of continuous shrinkage priors such as the horseshoe, Dirichlet-Laplace, normal-gamma, and t-mixtures.

    4.3 Regression beyond linear

    Regression beyond the linear setting in the high-dimensional context is of substan-tial interest. One of the first papers on convergence properties of posterior distribu-tions in the high-dimensional Bayesian setting is Jiang [49], who derived consis-tency of the posterior distribution in a generalized linear model for the dimension ppossibly much larger than the sample size n, needing only log p = o(n1−ξ ) for someξ > 0, but only in terms of the Hellinger distance on the underlying densities. Pos-terior concentration and a Bernstein-theorem for the parameter were obtained ear-lier by Ghosal [38] without assuming any sparsity structure but under the restrictedgrowth condition (p4 log p)/n→ 0. Atchadé [4] considered a pseudo-posterior dis-tribution in a general setting assuming only a certain local expansion of the pseudo-likelihood ratio and derived posterior contraction rates for hard-spike-and-slab pri-ors by constructing certain test functions. Using Atchade’s test construction andprior concentration bounds of Song and Liang [95], Wei and Ghosal [111] estab-lished posterior contraction for high-dimensional logistic regression using a varietyof continuous shrinkage priors.

    Bayesian high-dimensional regression in the nonparametric setting has been ad-dressed for additive structure with random covariates. Yang and Tokdar [115] con-sidered Gaussian process priors for each selected component function aided by avariable selection prior on the set of active predictors and showed that the mini-max rate max{

    √(s0 log p)/n,

    √s0n−α/(2α+1)} is obtained up to a logarithmic fac-

    tor, where s0 is the number of active predictors and α is the smoothness level ofeach component function. Using an orthogonal basis expansion technique, Belitserand Ghosal [11] extended their oracle technique from linear to additive nonparamet-ric regression and obtained the exact minimax rate with hard-spike-and-normal slabprior with the means of the slabs selected by the empirical Bayes technique. Theyalso obtained coverage of adaptive size balls of functions under an analog of theEBR condition for random covariates, called the ε-EBR condition. Wei et al. [112],extending the work of Song and Liang [95], obtained the optimal contraction rate

  • 12 Sayantan Banerjee, Ismael Castillo and Subhashis Ghosal

    and consistency of variable selection for additive nonparametric regression usingB-spline basis expansion prior with a multivariate version of the Dirichlet-Laplacecontinuous shrinkage prior on the coefficients. The multivariate version is neededto maintain the group selection structure for the coefficient vector correspondingto the same component function. Shen and Ghosal [94] considered the problem ofestimating the conditional density of a response variable Y ∈ [0,1] on a predictorX ∈ [0,1]p with log p = O(nα) for some α < 1, assuming that only s0 predictorsare active, where s0 is unknown but does not grow. They put a prior on the condi-tional density p(y|x) through tensor products of B-splines expansion. First, a subsetselection prior with enough control on the effective dimension is used to select theactive set, and then priors on the lengths of the spline bases are put. Finally, thecoefficients arrays are given independent Dirichlet distributions of the appropriatedimensions. Shen and Ghosal [94] showed the double adaptation property — theposterior contraction rate adapts to both the underlying dimension of the active pre-dictor set and the smoothness of each function α , i.e., n−α/(2α+s0+1). In fact, theconditional density can have anisotropic smoothness (i.e. different smoothness indifferent directions), and then the harmonic mean α∗ will replace α . Norets andPati [81] obtained analogous results for the conditional density of Y ∈ R on a pre-dictor X ∈ Rp using Dirichlet process mixtures of normal prior.

    5 Learning structural relationship among many variables

    5.1 Estimating large covariance matrices

    Understanding the dependence among a large collection of variables X = (Xi : i =1, . . . , p) is an important problem to study. The simplest measure of dependenceis given by the pairwise covariances between these variables, which leads to theproblem of estimating a large covariance matrix. In the setting of a large collec-tion of variables, it is natural to postulate that most of these variables are pairwiseindependent, or at least uncorrelated, meaning that most off-diagonal entries arezero, or at least close to zero. This introduces a useful sparsity structure that allowsmeaningful inference with relatively fewer observations. One particularly promi-nent structure is (approximate) banding, which means that when the variables arearranged in some natural order, the pairwise covariances are zero (or decay quickly)if the lag between the corresponding two indexes is larger than some value. For ex-ample, an exact banding structure arises in a moving average (MA) process, whilein an autoregressive (AR) or ARMA process, the covariances exponentially decaywith the lag. Banding or tapering of the sample covariance is often used to estimatesuch a large covariance matrix. When the sparse structure does not have any spe-cial pattern, threshold methods are often used, but positive definiteness may not bepreserved. Another important low-dimensional structure is given by a sparse pluslow-rank decomposition of a matrix Σ = D+ΛΛ>, where D is a scalar matrix and

  • Bayesian inference in high-dimensional models 13

    Λ is a “thin matrix”, that is, Λ is p× r, where r� p. Moreover, the columns of Λthemselves may be sparse, allowing further dimension reduction. Such a structurearises in structural linear models Xi = Ληi + εi, where Λ is a p× r sparse factorloading matrix and ηi ∼ Np(0, I) are the latent factors independent of the error se-quence εi ∼ Np(0,σ2I). In this setting, a Bayesian method was proposed by Pati etal. [85]. They considered independent hard-spike-and-slab prior with a normal slabon entries of Λ and inverse-gamma prior on σ2, along with a Poisson-tailed prioron the rank r. They showed that if the number of entries in each column of Λ isbounded by s0 and the true σ2 and the number of latent factors are bounded, thenthe posterior contraction rate

    √(s0 logn log p)/n can be obtained.

    5.2 Estimating large precision matrices and graphical models

    A more intrinsic relation among a set of variables is described by conditional de-pendence of a pair when the effects of the remaining variables are eliminated byconditioning on them. In a large collection of variables, most pairs may be assumedto be conditionally independent given others. It is convenient to describe this struc-ture using a graph, where each variable stands for a node and an edge connects a pairof nodes if and only if the two are conditionally dependent. Therefore such modelsare popularly called graphical models. An introduction to graphical models is givenin the supplementary material part. If Xi and X j are conditionally independent givenX−i,− j := (Xk : k 6= i, j), then it follows that ωi j = 0, where ((ωi j)) = Σ−1 is theprecision matrix of X , to be denoted by Ω . In a Gaussian graphical model (GGM),i.e., when X is distributed as jointly normal, X ∼Np(0,Ω−1), then the converse alsoholds, namely ωi j = 0 implies that Xi and X j are conditionally independent givenX−i,− j. Thus in a Gaussian graphical model, the problem of learning the intrinsicdependence structure reduces to the problem of estimating the precision matrix Ωunder sparsity (i.e., most off-diagonal entries of Ω are 0).

    5.2.1 Banding and other special sparsity patterns

    A special type of sparsity is given by an approximate banding structure of Ω . Notethat this is different from banding of the covariance matrix Σ , as the inverse of abanded matrix is only approximately banded. Among familiar times series, an ARprocess has a banded precision matrix while an MA process has an approximatebanding structure. The graph corresponding to a banded precision matrix has edgeset {(i, j) : |i− j| ≤ k, i 6= j}, k is the size of the band, which is always a decompos-able graph, allowing the use of structurally rich priors in this context. In particular,a conjugate graphical Wishart (G-Wishart) prior (see the Supplementary Materialpart for its definition and properties) can be put on Ω together with a choice of k,which may even be given a prior. Banerjee and Ghosal [6] showed that with thisprior, the posterior contraction rate in terms of the spectral norm at an approxi-

  • 14 Sayantan Banerjee, Ismael Castillo and Subhashis Ghosal

    mately banded true Ω with eigenvalues bounded away from 0 and infinity is givenby max{k5/2

    √(log p)/n,k3/2γ(k)}, where γ(k) is the banding approximation rate

    given by the total contribution of all elements outside the band. This, in partic-ular, implies that for a k0-banded true precision matrix with fixed k0, the rate isnearly parametric

    √(log p)/n. The rate calculation extends to general decompos-

    able graphs, with k standing for the maximal cardinality of a clique, as shown byXiang et al. [113], who used distributional results on Cholesky parameter of de-composable G-Wishart matrices. Another proof of the same result using the originaltechnique of Banerjee and Ghosal [6] was given by Banerjee [5].

    Lee and Lee [58] considered a class of bandable precision matrices in the high-dimensional setup, using a modified Cholesky decomposition (MCD) approachΩ = (Ip−A)>D−1(Ip−A), where D is a diagonal matrix, and A is lower-triangularwith zero diagonal entries. This results in the k-banded Cholesky prior with non-zero entries of A getting the improper uniform prior and the diagonal entries atruncated polynomial prior. They considered a decision-theoretic framework andshowed convergence in terms of posterior expected loss. Their rate is sharper withk5/2 replaced by k3/2, but the result is not exactly comparable as their class of trueprecision matrix is different. The two become comparable only for near-bandablematrices with exponentially decreasing decay functions, in which case the rates areessentially equivalent. Lee and Lin [59] proposed a prior distribution that is tailoredto estimate the bandwidth of large bandable precision matrices. They establishedstrong model selection consistency for the bandwidth parameter along with the con-sistency of Bayes factors.

    A natural fully Bayesian approach to graphical structure selection is to put aprior p(G) on the underlying graph G and then a G-Wishart prior p(Ω |G) on theprecision matrix Ω given G. The joint posterior distribution of (Ω ,G) is then givenby p(Ω ,G|X (n))∝ p(X (n)|Ω ,G)p(Ω |G)p(G)., where X (n) stand for n i.i.d. observa-tions on X . Taking a discrete uniform distribution over the space G of decomposablegraphs for p variables, we get the joint distribution of the data and the graph G, af-ter integrating out Ω as p(X (n),G) = ((2π)np/2#G )−1IG(δ +n,D+nSn)/IG(δ ,D),where IG is defined in (17). This immediately leads to an expression for the Bayesfactor, on the basis of which model selection may be implemented. Computationaltechniques in graphical models are discussed in the Supplementary Materials part.

    5.2.2 Models without specific sparsity patterns

    When the graph does not have an orderly pattern, the precision matrix may be es-timated by introducing sparsity in any off-diagonal entries. The most commonlyknown non-Bayesian method is given by the graphical lasso (Friedman et al. [35]),which is obtained by maximizing the log-likelihood subject under the `1-penaltyon the off-diagonal elements. A Bayesian analog, called the Bayesian graphicallasso, was proposed by Wang [106] by imposing independent Laplace prior on theoff-diagonal elements and exponential priors on the diagonal elements, subject to apositive definiteness restriction on Ω . As in a Bayesian lasso, a block Gibbs sam-

  • Bayesian inference in high-dimensional models 15

    pling method based on the representation of Laplace as a scale mixture of normalis used. Clearly, the Bayesian graphical lasso is motivated by the desire to make theposterior mode the graphical lasso, but the main drawback is that the whole posterioris never supported on sparse precision matrices, and the posterior concentration issuboptimal, as in the case of the Bayesian lasso. The forceful restriction to positivedefiniteness also leads to an intractable normalizing constant in the prior, althoughWang [106, 107] developed a clever computational trick to avoid the computationof the normalizing constant in posterior sampling, known as scaling-it-up. More de-tails of the computational algorithm are described in the Supplementary Materialspart. To alleviate the problem of the lack of sparsity in the posterior, Wang [106]used a thresholding approach proposed in Carvalho et al. [23]. Li et al. [65] pro-posed using the horseshoe prior on the off-diagonal elements instead of Laplace,thus leading to the ‘graphical horseshoe’ procedure. They also developed an anal-ogous block Gibbs sampling scheme using a variable augmentation technique forhalf-Cauchy priors proposed in Makalic and Schmidt [69]. The resulting procedureseems to have a better posterior concentration in terms of the Kullback-Leibler di-vergence, and smaller bias for non-zero elements. Other shrinkage priors based onscale mixture of uniform distributions were explored in Wang and Pillai [110].

    The convergence properties of the posterior distribution of a sparse precisionmatrix with an arbitrary graphical structure were studied by Banerjee and Ghosal[7]. They considered a prior similar to the Bayesian graphical lasso except that alarge point-mass at 0 is added to the prior distribution of off-diagonal entries, and thetotal number of non-zero off-diagonal entries is restricted. They derived posteriorcontraction rate in terms of the Frobenius norm as

    √((p+ s0) log p)/n, where s0 is

    the number of true non-zero off-diagonal entries of the precision matrix, which isthe same as the convergence rate of the graphical lasso and is optimal in that class.The proof uses the general theory of posterior contraction (Ghosal et al. [40]) byestimating the prior concentration near the truth and the entropy of a suitable subsetof precision matrices, which receives most of the prior mass. The sparsity built inthe prior helps control the effective dimension, and thus the prior concentration andentropy. Control over the eigenvalues allows linking the Frobenius norm with theHellinger distance on the normal densities. Their technique of proof extends to soft-spike-and-slab and continuous shrinkage priors, as long as the prior concentrationat zero off-diagonal values is sufficiently high.

    Niu et al. [80] addressed the problem of graph selection consistency under modelmisspecification. In the well-specified case where the true graph is decomposable,strong graph selection consistency holds under certain assumptions on graph sizeand edges, using a G-Wishart prior. For the misspecified case, where the true graphis non-decomposable, they showed that the posterior distribution of the graph con-centrates on the set of minimum triangulations of the true graph.

  • 16 Sayantan Banerjee, Ismael Castillo and Subhashis Ghosal

    5.3 High dimensional discriminant analysis

    Consider a classification problem based on a high-dimensional predictor X =(X1, . . . ,Xp) ∼ Np(µ,Ω−1), where (µ,Ω) = (µ1,Ω1) for an observation from thefirst group and (µ,Ω) = (µ2,Ω2) when the observation comes from the secondgroup. Linear and quadratic discriminant analyses are popular model-based classi-fication methods. The oracle Bayes classifier, which uses the true values of µ1,µ2and Ω1,Ω2, has the lowest possible misclassification error. Using a Bayesian proce-dure with priors on µ1,µ2 and Ω1,Ω2, the performance can be nearly matched withthe oracle if the posterior distributions of µ1,µ2 and Ω1,Ω2 contract near the truevalues sufficiently fast. In the high-dimensional setting, this is possible only if thereis some lower-dimensional structure like sparsity. Du and Ghosal [34] considereda prior on a precision matrix Ω based on a sparse modified Cholesky decomposi-tion Ω = LDL>, where L is sparse and lower triangular with diagonal entries 1 andD is a diagonal matrix. The greatest benefit of using a Cholesky decomposition inthe sparse setting is that the positive definiteness restriction on Ω is automaticallymaintained. However, a drawback is that the sparsity levels in the rows of Ω aredependent on the ordering of the coordinates, which may be somewhat arbitrary. Toalleviate the problem, with a hard-spike-and-slab prior for the entries of L, the prob-ability of a non-zero entry should be decreased with the row index i proportionalto i−1/2. Then it follows that for i and j roughly proportional and both large, theprobability of a non-zero at the ith and jth rows of Ω are roughly equal. Du andGhosal [34] used a soft-spike-and-slab or the horseshoe priors, for which also a sta-tionary (approximate) sparsity in the off-diagonal elements of Ω can be maintainedby making the slab probability decay like i−1/2 for the former and the local param-eter decay at this rate for the horseshoe prior. Du and Ghosal [34] showed that themisclassification rate of the Bayes procedure converges to that of the oracle Bayesclassifier for a general class of shrinkage priors when p2(log p)/n→ 0, providedthat the number of off-diagonal entries in the true Ω is O(p). This is a substan-tial improvement over the requirement p4/n→ 0 needed to justify this convergencewithout assuming sparsity of Ω .

    5.4 Exponential family with a graphical structure

    The representation of various graphical models as exponential families facilitatesthe inferential problem of graph selection and structure learning. Suppose X =(X1, . . . ,Xp) is a p-dimensional random vector and G = (V,E) is an undirectedgraph. The exponential family graphical model (with respect to G) has the joint den-sity (with respect to a fixed dominating measure µ like the Lebesgue or the countingmeasure) of the form

    p(X ;θ) = exp{∑r∈V

    θrB(Xr)+ ∑(r,t)∈E

    θrtB(Xr)B(Xt)+ ∑r∈V

    C(Xr)−A(θ)}, (6)

  • Bayesian inference in high-dimensional models 17

    for sufficient statistics B(·), base measure C(·) and log-normalization constant A(θ).Note that for the choice of B(X) = X/σ ,C(X) = −X2/2σ2, we get the Gaussiangraphical model

    p(X ;θ) ∝ exp{∑r∈V

    1σr

    θrXr + ∑(r,t)∈E

    1σrσt

    θrtXrXt −∑r∈V

    1σ2r

    X2r }; (7)

    here {θrt} are the elements of the corresponding precision matrix.An alternative formulation using a matrix form, called an exponential trace-

    class model, was recently proposed by Zhuang et al. [116]. The family of den-sities for X is indexed by a q× q matrix M, and is given by the expressionf (X |M) = exp[−〈M,T (X)〉+ ξ (X)− γ(M)], where 〈·, ·〉 stands for the trace in-ner product of matrices, and T : Rp → Rq×q, q may or may not be the same asp, and γ(M) is the normalizer for the corresponding exponential family given byγ(M) = log

    ∫exp[−〈M,T (X)〉+ξ (X)]dµ(X). The Gaussian graphical model is in-

    cluded as q= p, T (X)=XX>. In this case, M agrees with the precision matrix and ispositive definite. The most interesting feature in an exponential trace class model isthat the conditional independence of Xi and X j given others is exactly characterizedby Mi j = 0. In the high-dimensional setting, it is sensible to impose sparsity of theoff-diagonal entries. In the Bayesian setting, this can be addressed by spike-and-slabor continuous shrinkage-type priors.

    5.4.1 Ising model

    For Bernoulli random variables defined over the nodes of the graph G, the exponen-tial family representation of the graphical model with the choice of B(X) = X andthe counting measure as the base measure gives the distribution

    p(X ;θ) = exp{∑r∈V

    θrXr + ∑(r,t)∈E

    θrtXrXt −A(θ)}, (8)

    which is popularly known as the Ising model. The conditional distribution of thenodes gives a logistic regression model. The model may also be represented as anexponential trace class model. Graph selection can be achieved via neighborhood-based variable selection method (Meinshausen and Bühlmann [71]) using the `1-penalty. In the high-dimensional setting, Barber and Drton[9] used the Bayesian In-formation Criteria (BIC) in the logistic neighborhood selection approach. Sparsitybased conditional MLE approach was proposed by Yang et al. [114]. In the Bayesiansetting, continuous shrinkage priors or spike-and-slab priors may be used for infer-ence on the graphical structure. Variational methods for inference on the parametersand evaluation of the log-partition function A(θ) were discussed in Wainright andJordan [105].

  • 18 Sayantan Banerjee, Ismael Castillo and Subhashis Ghosal

    5.4.2 Other exponential family graphical models

    Count data are often obtained from modern next-generation genomic studies. Tostudy relations in count data, variables can be modeled via a Poisson distribution onthe nodes, giving the Poisson graphical model

    p(X ;θ) = exp{∑r∈V

    (θrXr− log(Xr!))+ ∑(r,t)∈E

    θrtXrXt −A(θ)}.

    Integrability forces θrt ≤ 0, so that only negative conditional relationships betweenvariables are captured. An alternative formulation of a Poisson graphical model asan exponential trace class model can accommodate positive interactions betweenvariables in the Poisson graphical model, where marginal distributions are Poissonalthough the conditional distributions are not (Zhuang et al. [116]). Other usefulexponential families include the multinomial Ising model.

    5.5 Nonparanormal graphical model

    A semiparametric extension of GGMs that can model continuous variables locatedat nodes of a graph is given by the nonparanormal model (Liu et al. [66]), where itis assumed that the vector of variables X = (X1, . . . ,Xp) ∈ [0,1]p reduces to a mul-tivariate normal vector through p monotone increasing transformations: for somemonotone functions f1, . . . , fp, f (X) := ( f1(X1), . . . , fp(Xp)) ∼ Np(µ,Σ) for someµ ∈Rp and positive definite matrix Σ . The model is not identifiable and needs to fixthe location and scale of the functions or the distribution. Liu et al. [66] developeda two-step estimation process in which the functions were estimated first using atruncated empirical through the relations f j(x) = Φ−1(Fj(x)), where Fj stands forthe cumulative distribution function of X j. Two different Bayesian approaches wereconsidered by Mulgrave and Ghosal [77, 76, 75] — based on imposing a prior onthe underlying monotone transforms in the first two papers, and based on a rank-likelihood which eliminates the role of the transformations in the third. In the firstapproach, a finite random series based on a B-spline basis expansion is used to con-struct a prior on the transformation. The advantage of using a B-spline basis is that inorder to maintain monotonicity, the prior on the coefficients only needs to be madeincreasing. A multivariate normal prior truncated to the cone of ordered values canbe conveniently used. However, to ensure identifiability, two constants f (0) = 1/2and f (3/4)− f (1/4) = 1 are imposed, which translate to linear constraints, andhence the prior remained multivariate normal before imposing the order restriction.Samples from the posterior distribution of the ordered multivariate normal coeffi-cients can be efficiently obtained using exact Hamiltonian MCMC (Packman andPaninski [82]). In Mulgrave and Ghosal [76], a normal soft-spike-and-slab priorwas put on the off-diagonal elements of Ω = Σ−1 and the scaling-it-up technique(Wang [107]) was utilized to avoid the evaluation of the intractable normalizing

  • Bayesian inference in high-dimensional models 19

    constant arising from the positive definiteness restriction. They also proved a con-sistency result for the underlying transformation in terms of a pseudo-metric givenby the uniform distance on a compact subinterval of (0,1). The approach in Mul-grave and Ghosal [76] used a connection with the problem of regression of a com-ponent given the following ones, where these partial regression coefficients form thebasis of a Cholesky decomposition for the precision matrix. Sparsity in these coef-ficients was introduced through continuous shrinkage priors by increasing sparsitywith the index as in Subsection 5.3. The resulting procedure is considerably fasterthan the direct approach of Mulgrave and Ghosal [77]. Mulgrave and Ghosal [76]also considered the mean-field variational approach for even faster computation. InMulgrave and Ghosal [75], the posterior distribution is altered by replacing condi-tioning on the data by conditioning on the ranks, which are the maximal invariantsunder all monotone transformations. The benefit is that then the likelihood is freeof the transformations, eliminating the need to assign prior distributions on these,and may be considered as fixed, even though those are unknown. The absence of thenonparametric part of the parameter thus makes the procedure a lot more efficientand robust. The posterior can be computed by Gibbs sampling using a simple data-augmentation technique with the transformed variables. Also the arbitrary centeringand scaling means that only a scale-invariant of Ω is identifiable. Mulgrave andGhosal [75] also derived a consistency result in a fixed dimensional setting usingDoob’s posterior consistency theorem (see Ghosal and van der Vaart [41]).

    6 Estimating a long vector smoothly varying over a graph

    Let us revisit the normal sequence model Xi = θi + εi, i = 1, . . . ,n, where εi ∼N(0,σ2) as in Section 2, but the sparsity assumption on the vector of meansf = (θ1, . . . ,θn) is not appropriate. Instead, the values are assumed to ‘smoothlyvary over locations’ in some appropriate sense. The simplest situation is that thesevalues lie over a linear lattice, but more generally, the positions may stand for thenodes on a graph. Smooth variation of θi with respect to i over a graph can bemathematically quantified through the graph Laplacian L = D−A, where D is thediagonal matrix of node degrees, and A stands for the adjacency matrix. More pre-cisely, vectors f = (θ1, . . . ,θn) over a graph of size n with graph Laplacian L, is saidto belong to a Hölder ball of β -smoothness of radius Q if 〈 f ,(I+(n2/rL)β ) f 〉 ≤Q2,where r stands for the “dimension” of the graph defined through the growth of eigen-values of the graph Laplacian L: the ith eigenvalue grows like (i/n)2/r. For latticegraphs, the dimension truly agrees with the physical dimension, but in general, itcan be a fractional number. In this setting, Kirichenko and van Zanten [56] showedthat the minimax rate of recovery is n−β/(2β+r), which depends on the smoothnessβ as well as the dimension r of the graph. A Bayesian procedure for this prob-lem was developed by Kirichenko and van Zanten [55] using a multivariate normalprior: f ∼ Nn(0,(n/c)(2α+r)/r(L+n−2I)(2α+r)/r), where c is given the standard ex-ponential prior. Using van der Vaart-van Zanten posterior contraction rate theory for

  • 20 Sayantan Banerjee, Ismael Castillo and Subhashis Ghosal

    Gaussian processes priors (Ghosal and van der Vaart [41], Chapter 11), they showedthat the posterior contracts at the optimal rate n−β/(2β+r) up to a logarithmic fac-tor, whenever the true smoothness is β ≤ α + r/2. An analogous result also holdsfor binary regression. Kirichenko and van Zanten [55] also showed that the limitedrange adaptation can be strengthened to full range adaptation using an exponentialof Laplacian covariance kernel.

    7 Matrix models

    A number of high-dimensional problems involve unknown matrices instead of vec-tors. Some examples include multiple linear regression with group sparsity, multi-task learning, matrix completion, stochastic block model and biclustering.

    7.1 Generic results in structured linear models

    Gao et al. [36] derived results on posterior dimensionality and contraction rates formany models simultaneously in a general ‘signal plus noise’ model when the signalis a vector or a matrix having some specific structure. One example is multiplelinear regression with group sparsity, where one observes Y = XB +W , with Xbeing a n× p matrix, B a p×m matrix, and the columns of B are assumed to sharecommon support of size s. They obtained the minimax rate s(m+ log(ep/s)) forthe prediction problem of estimating XB in Frobenius norm. This model is a specialcase of so-called multi-task learning problems, where the columns of B share somespecific structure. For instance, instead of assuming a joint sparsity pattern amongcolumns, one may assume that columns of B can only be chosen from a given listof k possible columns, with k typically much smaller than m. The posterior ratepk+m logk was derived for prediction in Gao et al. [36] (it is optimal as soon ask < pm). Dictionary learning, on the other hand, assumes that the signal matrix ofsize n× d is θ = QZ, for Q an n× p dictionary matrix and Z a discrete matrixof size p× d with sparse columns. Gao et al. [36] derived the adaptive rate np+ds log(ep/s) for estimating θ , which is optimal if smaller than nd. Kim and Gao [54]proposed an EM-type algorithm to simulate from posterior aspects correspondingto the prior considered in Gao et al. [36], recovering as a special case the EMVSalgorithm of Roc̆kovà and George [90] for linear regression. Belitser and Nurushev[12] also followed this general framework, considering a re-centered normal priorextending the approach in Belitser and Nurushev [13] and Belitser and Ghosal [11],and derived both local oracle posterior rates, as well as optimal-size credible ballswith guaranteed coverage.

  • Bayesian inference in high-dimensional models 21

    7.2 Stochastic block model and community detection

    In a stochastic block model (SBM) with k groups, one observes Y = θ +W withθi j = Qz(i)z( j) ∈ [0,1]n×n for some matrix Q of size k× k of edge probabilities, a la-beling map z ∈ {1, . . . ,k}n and W a centered Bernoulli noise. Gao et al. [36] treatedthe question of estimation of θ in Frobenius norm within their general framework,getting the adaptive minimax rate k2 +n logk with unknown k. A similar result andcoverage of credible sets were derived by Belitser and Nurushev [12]. Pati and Bhat-tacharya [84] obtained the near-optimal rate k2 log(n/k)+n logk for known k usingindependent uniform prior on coordinates of Q and Dirichlet prior probabilities forthe label proportions. The biclustering model can be viewed as an asymmetric ex-tension of the SBM model, where θ is a n×m rectangular matrix and rows andcolumns of Q have their own labeling, with k and l groups respectively. This modelwas handled in Gao et al. [36] and Belitser and Nurushev [12] using similar tech-niques as before, with an optimal adaptive posterior rate kl +n logk+m log l.

    Another popular question is that of the recovery of the labels z· (possibly up toa permutation), also known as community detection. This is a more ambitious taskcompared to the estimation of θ , and it can only be guaranteed either asymptoti-cally (Bickel et al. [18]), or non-asymptotically imposing some separation betweenclasses (Lei and Zhu [60]); see Castillo and Orbanz [26] for a discussion on unifor-mity issues. Van der Pas and van der Vaart [104] showed that the posterior modecorresponding to a beta prior on edge probabilities and Dirichlet prior probabili-ties for the label proportions asymptotically recovers the labels when the number ofgroups k is known, provided that the mean degree of the graph is at least of orderlog2 n. Their approach relies on studying Bayesian modularity, that is, the marginallikelihood of the class labels given the data, when the edge probabilities are inte-grated out.

    7.3 Matrix completion problem

    In the matrix completion model (also known as the “Netflix problem”), one observesn noisy entries of an unknown m× p matrix M typically assumed to be of low-rankr (or well-approximated by a low-rank matrix). The entries are sampled at randomfrom a distribution on entries indices. The rate (m+ p)r log(m∧ p), minimax optimalup to the logarithmic term, is obtained by Mai and Alquier [68] in the PAC-Bayesian(‘PAC’ stands for Probably Approximately Correct) setting for a prior sitting closeto small-rank matrices and a class of tempered posteriors (see the discussion sectionbelow for more on this topic).

  • 22 Sayantan Banerjee, Ismael Castillo and Subhashis Ghosal

    8 Discussion and possible future research directions

    A popular approach in machine learning is using a pseudo- (or generalized) Bayesmeasure by replacing the likelihood by a small power (also called the ‘temperature’)of it in the standard Bayes posterior distribution. This can attenuate the dependencein the data, providing more robustness in misspecified settings. The log-likelihoodmay also be replaced by an empirical risk, depending on the loss-function of inter-est. PAC-Bayesian inequalities refer to the theory delivering in-probability boundsfor such pseudo-posterior measures. Seminal contributions to PAC-Bayes theoryinclude those by McAllester, Catoni, and T. Zhang. Theory and algorithms for high-dimensional data were considered among others by Audibert, Alquier, Dalalyan andTsybakov, Martin and Walker and Grünwald and co-authors. We refer to the surveyby Guedj [46] for a recent overview of the field. We note that although such methodsare sometimes referred to as generalized Bayes, they often assume that the temper-ature parameter is strictly smaller than 1 (or going to 0 in some cases), which oftenseparates them from results on original Bayes posteriors. Obtaining unifying theoryfor different ranges of temperature parameters including the original Bayes is aninteresting direction for future work.

    Appendix

    8.1 Undirected Graphs

    An undirected graph G = (V,E) consists of a non-empty set of vertices or nodesV = {1, . . . , p} and a set of edges E ⊆ {(i, j) ∈ V ×V : i < j}. Nodes which areconnected by an edge are termed as adjacent nodes. If all the nodes of a graph areadjacent to each other, we have a complete graph. A subgraph G′ = (V ′,E ′) of G,denoted by G′ ⊆ G is such that V ′ ⊆ V and E ′ ⊆ E. A subset V ′ ⊆ V induces thesubgraph GV ′ = (V ′,(V ′×V ′)∩E). If a subgraph GV ′ is not contained in any othercomplete subgraph of G, that is, if GV ′ is a maximal complete subgraph of G, then,V ′ ⊆V is called a clique of G.

    A finite subcollection of adjacent edges (v0, . . . ,vk−1) in G forms a path of lengthk. If v0 = vk−1, that is, if the end-points of the path are identical, then we have a k-cycle. A chord of a cycle is a pair of nodes in that cycle which are not consecutivenodes in that path, but are adjacent in G. For subgraphs G1,G2 and G3 of G, if everypath from a node v1 ∈ G1 to a node v2 ∈ G2 contains a node in G3, then G3 is saidto separate G1 and G2 and is called a separator of G.

    A major portion of our discussion would focus on decomposable graphs. A graphG is said to be decomposable if every cycle in G having length greater than orequal to four has a chord. Decomposability of a graph can also be characterizedby the existence of a perfect ordering of the cliques. An ordering of the cliques(C1, . . . ,Ck) ∈ C and separators (S2, . . . ,Sk) ∈S is perfect if it satisfies the running

  • Bayesian inference in high-dimensional models 23

    intersection property, that is, there exists a i < j such that S j = H j−1 ∩C j ⊆ Ci,where H j = ∪ii=1C j. For more details on decomposability and perfect ordering ofcliques, we refer the readers to Lauritzen [57].

    8.2 Graphical models

    An undirected graph G equipped with a probability distribution P on the node-setV is termed an undirected graphical model. We shall particularly focus on Gaus-sian data, hence leading to the concept of Gaussian graphical models. To definethe same, we first discuss the concept of Markov property, or conditional inde-pendence property of a p-dimensional random variable. A p-dimensional randomvariable X = (X1, . . . ,Xp) is said to be Markov with respect to an undirected graphG if the components Xi and X j are conditionally independent given the rest of thevariables, whenever (i, j) /∈ E. Thus, an undirected graph G with a Gaussian dis-tribution on the components of the above random variable X corresponding to thenodes V = {1, . . . , p} is called a Gaussian graphical model (GGM). Without lossof generality, we can assume that the mean of the Gaussian distribution to be zero.For a GGM, there exists a correspondence between the edge set E and the inversecovariance matrix (or, the precision matrix) Ω = Σ−1 of X owing to the Markovproperty. To be precise, whenever (i, j) /∈ E, the (i, j)th element of Ω , given by ωi jis exactly zero and vice versa. This leads us to consider the cone of positive definitematrices

    PG = {Ω ∈M+p : ωi j = 0,(i, j) /∈ E}, (9)

    which defines the parameter space of the precision matrix for a Gaussian distributiondefined over a graph G. For a decomposable graph G, the parameter space for thecovariance matrix Σ = ((σi j)) is defined by the set QG of partially positive definitematrices, which is a subset of IG, the set of incomplete matrices with missing entriesσi j whenever (i, j) /∈ E. Then,

    QG = {B ∈ IG : BCi > 0, i = 1, . . . ,k}. (10)

    Gröne et al. [45] showed that there is a bijection between the spaces PG and QG.To be precise, for Ω ∈ PG, we can define QG as the parameter space for the GGMwhere Σ = κ(Ω−1), and κ : M+p → IG is the projection of M+p into IG. Thus, aGGM Markov with respect to a graph G is given by the family of distributions

    NG = {Np(0,Σ),Σ ∈ QG}= {Np(0,Ω−1),Ω ∈ PG}. (11)

  • 24 Sayantan Banerjee, Ismael Castillo and Subhashis Ghosal

    8.3 Hyper Inverse-Wishart and G-Wishart priors

    The inverse-Wishart distribution IWp(δ ,D) with degrees of freedom δ and a p-dimensional positive definite fixed scale matrix D is the conjugate prior for the co-variance matrix Σ in case of a complete graph. We denote Σ ∼ IWp(δ ,D) havingdensity

    p(Σ | δ ,D) = c(δ , p){det(Σ)}−(δ+2p)/2 exp{−tr(Σ−1D)/2}, (12)

    where c(δ , p) = {det(D)/2}(δ+p−1)/2Γ−1p {(δ + p−1)/2} is the normalizing con-stant and Γp(t) = π p(p−1)/4 ∏pj=1 Γ (t− ( j−1)/2) is the p-variate gamma function.

    For decomposable graphs, recall that a perfect set of cliques always exists. In thatcase, we can write the density as

    p(x | Σ ,G) ∝ ∏C∈C pC(xC | ΣC)∏S∈S pS(xS | ΣS)

    , (13)

    which is the Markov ratio of respective marginal Gaussian distributions correspond-ing to the cliques and separators of G. Dawid and Lauritzen [32] came up witha generalization of the inverse-Wishart, called the hyper inverse-Wishart, which isconjugate to the Gaussian distribution Markov with respect to the graph G. Theform of the prior distribution depends on the clique-marginal covariance matricesΣC, such that ΣC ∼ IW|C|(δ ,DC), where DC is the submatrix of the scale matrix Cinduced by the clique C ∈ C . The hyper inverse-Wishart is thus constructed on theparameter space QG with density given by

    pG(Σ | δ ,D) =∏C∈C p(ΣC | δ ,DC)∏S∈S p(ΣS | δ ,DS)

    , (14)

    where p(·) refers to the density of the respective inverse-Wishart distribution.For a complete graph, the inverse-Wishart prior on the covariance matrix induces

    a conjugate prior for the precision matrix Ω , namely, the Wishart distribution, withdensity

    p(Ω | δ ,D) = c(δ , p){det(Ω)}−(δ−2)/2 exp{−tr(ΩD)/2}, (15)

    with identical normalizing constant as in inverse-Wishart. Similar in lines with this,for a decomposable graphical model, the hyper inverse-Wishart prior induces a priordistribution on the precision matrix Ω with density

    pG(Ω | δ ,D) = IG(δ ,D)−1{det(Ω)}(δ−2)/2 exp{−tr(DΩ)/2} (16)

    where IG(δ ,A) is the normalizing constant given by

    IG(δ ,D) =∏s∈S {det(DS)}(δ+|S|−1)/2Γs((δ + |S|−1)/2)∏c∈C {det(DC)}(δ+|C|−1)/2Γc((δ + |C|−1)/2)

    . (17)

  • Bayesian inference in high-dimensional models 25

    This distribution on Ω is called the G-Wishart distribution. It forms the Diaconis-Ylvisaker conjugate prior for the precision matrix of a Gaussian distribution Markovwith respect to the decomposable graph G. However, unlike the hyper inverse-Wishart prior, the G-Wishart prior can be generalized to non-decomposable graphi-cal models as well, except that there is no closed form analytical expression of thenormalizing constant IG(δ ,D).

    Letac and Massam [63] introduced a more general class of conjugate priors,namely the WPG -Wishart class of priors for the precision matrix Ω . The WPG -Wishartdistribution WPG(α,β ,D) has three sets of parameters — α and β which are suit-able functions defined on the cliques and separators of the graph, and a scale matrixD. We note that the above class of prior distribution is the Wishart distribution fora fully connected graph, and includes the G-Wishart distribution for decomposablegraphs as a special case with suitable choices of α and β .

    8.4 Posterior computation in graphical models

    8.4.1 Exact expressions under conjugacy

    Denoting by X the n× p data matrix corresponding to a random sample from aGaussian distribution Np(0,Ω−1), conditional on a specific graph G, the posteriordensity p(Ω | X,G) of Ω is given by

    IG(δ +n,D+Sn)−1{det(Ω)}(δ+n−2)/2 exp[−tr{(D+Sn)Ω}/2], (18)

    where Sn = XXT .Carvalho et al. [22] proposed a sampler for the hyper inverse-Wishart distribution

    corresponding to a decomposable graph based on distributions of submatrices of thecovariance matrix Σ . Samples taken from the hyper inverse-Wishart can be invertedto get samples from the corresponding G-Wishart distribution.

    Rajaratnam et al. [88] provided a closed form expression of the posterior meanE(Ω | Sn) of Ω with a WPG -Wishart prior as

    −2

    [k

    ∑j=1

    (α j−n/2)((D+κ(nSn))−1C j )0−

    k

    ∑j=2

    (β j−n/2)((D+κ(nSn))−1S j )0

    ], (19)

    where for a p× p matrix A = ((ai j)), A−1T denotes the inverse (AT )−1 of the sub-matrix AT and (AT )0 = ((a∗i j)) denotes a p-dimensional matrix such that a

    ∗i j = ai j

    for (i, j) ∈ T × T , and 0 otherwise. Here α j and β j are the jth component of theparameters α and β respectively, j = 1, . . . , p. They also obtained the expression forthe Bayes estimator with respect to the Stein’s loss L(Ω̂ ,Ω) = tr(Ω̂ −Ω)2 underthe WPG -Wishart prior.

  • 26 Sayantan Banerjee, Ismael Castillo and Subhashis Ghosal

    8.4.2 MCMC sampling methods for G-Wishart prior

    Madigan and York [67] proposed a Metropolis-Hastings algorithm based approachto traverse the space of decomposable graphs G , while Giudicci and Green [43] useda reversible jump MCMC sampler also including the precision matrix as one of thestate variables.

    When the graph is not necessarily decomposable, the G-Wishart prior is stillconjugate, but the normalizing constant IG(δ ,D) does not have a simple expres-sion. Only recently Uhler et al. [97] derived an expression, but it is still ex-tremely difficult to compute for use in inference. Lenkoski and Dobra [62] de-veloped a Laplace approximation ˆIG(δ ,D) = exp{l(Ω̂)}(2π)p/2 det(H(Ω̂))−1/2,where l = ((δ − 2) logdet(Ω)− tr(DΩ))/2, the matrix Ω̂ is the mode of the G-Wishart density and H is the Hessian matrix associated with l, but it lacks accuracyunless the clique sizes are small.

    Atay-Kayis and Massam [3] proposed a Monte Carlo based method to samplefrom the G-Wishart distribution as well as to compute the prior normalizing con-stant. They considered the Cholesky decomposition D−1 = Q>Q and Ω = Φ>Φ ,and defined Ψ = ΦQ. Since for (i, j) /∈ E, ωi j = 0, and hence the elements ψi jof Ψ for (i, j) /∈ E are functions of ψi j, (i, j) ∈ E and ψii, i = 1, . . . , p. Thusthe free elements appearing in the Cholesky decomposition of Ω are given byΨ E = (ψi j, (i, j) ∈ E; ψii, i = 1, . . . , p). They showed that the free elements havedensity p(Ψ E) ∝ f (Ψ E)h(Ψ E), where f (Ψ E) = exp{−∑(i, j)/∈E ψ2i j/2} is a functionof the non-free elements of Ψ , which in turn can be uniquely expressed in termsof Ψ E . Furthermore, h(Ψ E) is the product of densities of random variables ψ2ii ∼χ2δ+vi , i = 1, . . . , p, and ψi j ∼ N(0,1), (i, j) ∈ E, where vi = |{ j : j > i,(i, j) ∈ E}.Generating samples for Ω then uses an acceptance-rejection method based on theabove density function of the free elements Ψ E . The normalizing constant IG(δ ,D)can be expressed as the product of a known constant and the expected value off (Ψ E). Using samples from the distribution of Ψ E , straightforward Monte Carlocomputation gives IG(δ ,D). However, the Monte Carlo integration method is com-putationally expensive for non-complete large prime components of the graph, ow-ing to a matrix completion step involving the non-free elements of Ψ .

    To alleviate this problem, Wang and Carvalho [108] used a prime-componentdecomposition so that sampling is individually carried over in lower-dimensionalsubgraphs of G. However, owing to the dependencies of the sampler on (δ ,D,G),the acceptance rate of the MCMC procedure can be very low. Mitsakakis et al.[73] developed an independent Metropolis-Hastings algorithm for sampling fromWG(δ ,D) based on the density of Ψ E . This method, though improves the acceptancerate over that proposed by Wang and Carvalho [108], suffers from issues of lowacceptance rate and slow mixing in large graphs. Similar in lines with Mitsakakis etal. [73], Dobra et al. [33] used a random walk Metropolis-Hastings sampler, whereonly one entry of Ψ E is perturbed in a single step compared to changing all theentries of Ψ E as in the former approaches. Though this method results in increasedefficiency of the sampler, it still suffers from the matrix completion problem for

  • Bayesian inference in high-dimensional models 27

    non-free elements, thus requiring a time complexity of O(p4) for each Monte Carlosamples. This results in painfully slow computations in large graphs.

    Jones et al. [53] used the method of Atay-Kayis and Massam [3] to computeprior and posterior normalizing constants of the G-Wishart distribution to traversethe space of graphs using a method called the Stochastic Shotgun Search Algorithmusing the uniform or Erdos-Renyi type priors on the space of graphs. Though themethod performs well in low dimensions, it fails to scale up in high dimensionsowing to large search space.

    Lenkoski and Dobra [62] and Mitsakakis et al. [73] proposed to use the Bayesianiterative proportional scaling algorithm developed by Piccioni [86] and Asci andPiccioni [2] to sample from the G-Wishart distribution. Their method requires enu-meration of maximum cliques of G, which is an NP-hard problem, and also requiresthe inversion of large matrices, which are computationally burdensome.

    As discussed before, Giudicci and Green [43] developed a reversible jumpMCMC method to learn the graphical structure by sampling over the joint spaceof (Ω ,G). Dobra et al. [33] also developed a reversible jump MCMC method over(Ω ,G), thus avoiding the issue of searching over a large space of graphs. But theirmethod cannot avoid the problems involving the computation of the prior normaliz-ing constants and matrix completion. The crucial bottleneck in computing the accep-tance probabilities of the MCMC based procedures developed for joint explorationof (Ω ,G) is the ratio of prior normalizing constants IG−e(δ ,D)/IG(δ ,D), whereG−e is the graph obtained from G = (V,E) by deleting a single edge e ∈ E. For thechoice of D = Ip, Letac et al. [64] showed that the above ratio can be reasonablyapproximated by constant times the ratio of two gamma functions as

    IG−e(δ ,D)/IG(δ ,D)≈ (2√

    π)−1Γ(

    δ +d2

    )/Γ(

    δ +d +12

    ), (20)

    where d is the number of paths of length two between the two nodes in edge e. Theyshowed that under certain conditions and graph configurations, the approximationperforms reasonably good.

    To circumvent the problems of computing the normalizing constant IG(δ ,D) forarbitrary graphs, especially in the high dimensional scenario, Wang and Li [109]proposed a double reversible jump MCMC algorithm using the partial analyticstructure (PAS) of the G-Wishart distribution. Their proposed algorithm involvesa Metropolis-Hastings step to move from G = (V,E) to G′ = (V,E ′), where E ′ dif-fers in only one edge from E, with the acceptance probability for the Metropolis-Hastings step having exact analytical expressions which are easy to compute.

    A completely new approach was introduced by Mohammadi and Wit [74], calledthe Birth-and-Death MCMC method (BDMCMC) for graphical model selection,which explores the space of graphs by addition (birth) or deletion (death) of an edgein the graph. They determine a continuous time birth-death Markov process on Ω ,more specifically, independent Poisson process on Ω such that under suitable con-ditions, the process has stationary distribution p(Ω ,G | X (n)). The birth and deathrates are given by

  • 28 Sayantan Banerjee, Ismael Castillo and Subhashis Ghosal

    βe(Ω) =P(G+e,Ω+e\(ωi j,ω j j) | X (n))

    P(G,Ω\ω j j | X (n)), for each e ∈ Ē,

    δe(Ω) =P(G−e,Ω−e\ω j j | X (n))

    P(G,Ω\(ωi j,ω j j) | X (n)), for each e ∈ E; (21)

    here G+e,G−e are graphs obtained from G after addition or removal of an edgee from G. The direct sampler method proposed in Lenkoski [61] is used to samplefrom the posterior distribution of Ω . The BDMCMC algorithm is easy to implementand scales well in high-dimensional problems, and also has superior performance instructure learning compared with Bayesian methods discussed earlier.

    8.4.3 MCMC sampling methods for the Bayesian graphical lasso and variants

    Wang [106] developed a block Gibbs sampler to simulate from the posterior distri-bution of the Bayesian graphical lasso though a reparameterization T = ((τi j)) withzeros as diagonal entries and τ in upper diagonal entries. Then, the matrices Ω ,Sand T are partitioned with respect to the last column and row as

    Ω =(

    Ω11 ω12ωT12 ω22

    ), S =

    (S11 s12sT12 s22

    ), ϒ =

    (ϒ11 τ12τT12 0

    ). (22)

    With the change of variable (ω12,ω22)→ (β = ω12,γ = ω22−ωT12Ω−111 ω12), the

    conditional posterior distribution of (β ,γ) given the rest is given by

    p(β ,γ,Ω11,T,X (n),λ ) ∝ γn/2 exp(

    s22 +λ2

    γ)

    ×exp(−1

    2[β T{D−1τ +(s22 +λ )Ω−111 }β +2s

    T12β ]

    ),

    where Dτ = diag(τ12), so that the conditional distributions of γ and β are inde-pendent gamma and a normal respectively, whereas the inverse of the latent scaleparameters τi j are independent inverse-Gaussian distributions. The resulting blockGibbs sampler iteratively samples one column of Ω at a time. It is interesting tonote that the positive definiteness constraint is also maintained throughout owingto the fact that γ is always positive definite. The posterior samples hence obtainedcannot be directly used for structure learning though, since the Bayesian graphicallasso prior puts zero mass on the event {ωi j = 0}.

    As discussed in the previous sections, computation poses a great deal of chal-lenge in Bayesian graphical models for high dimensional situations. To deal withlarge dimensions in arbitrary graphs, Wang [107] developed a new technique calledstochastic search structure learning for precision matrices as well as covariancematrices in a graphical model set up, based on soft-spike-and-slab priors on theelements of the matrix. This work also focuses on the issue of structure learningof graphs using a fully Bayesian model by specifying priors on binary variables

  • Bayesian inference in high-dimensional models 29

    Z = (zi j)i< j, which are indicators for the edges of the underlying graph G. The hi-erarchical prior is then specified as

    p(Ω | Z,θ) = C(Z,ν0,ν1,λ )−1 ∏i< j

    N(ωi j | 0,ν2zi j)p

    ∏i=1

    Exp(ωii | λ/2),

    p(Z | θ) = C(θ)−1C(Z,ν0,ν1,λ )∏i< j

    πzi j(1−π)zi j , (23)

    where θ = (ν0,ν1,π,λ ), N(a | 0,ν2) is the density of a normal distribution withmean zero and variance ν2, C(θ) and C(Z,ν0,ν1,λ ) are normalizing constants, andνzi j is ν0 or ν1 according as zi j is 0 or 1. The hyperparameter π ∈ (0,1) controls theprior distribution of the binary edge indicators in Z. The hierarchical prior specifi-cation above leads to the following prior specification on Ω :

    p(Ω) = C(θ)−1 ∏i< j

    {(1−π)N(ωi j | 0,ν20 )+πN(ωi j | 0,ν21 )

    p

    ∏i=1

    Exp(ωii | λ/2)1(Ω ∈M+p ), (24)

    A small ν0 > 0 and a large value of ν1 > 0 induces a soft-spike-and-slab prior onthe elements of Ω . The above prior having a two-component mixture of normals, fa-cilitates in graph structure learning via the latent binary indicators Z. The samplingprocedure for generating samples from the posterior distribution p(Ω | X (n),Z) isidentical to the block Gibbs sampler proposed in [106], via introduction of the p-dimensional symmetric matrix V = ((ν2zi j)) with zeros in diagonal and ((ν

    2i j : i < j))

    as the upper-diagonal entries, where ν2i j = ν2zi j . The conditional posterior distri-butions of the binary indicator variables Z given the data and Ω are independentBernoulli with success probability

    p(zi j = 1 |Ω ,X (n)) =πN(ωi j | 0,ν21 )

    πN(ωi j | 0,ν21 )+(1−π)N(ωi j | 0,ν20 ). (25)

    Though the structure learning accuracy of the above method is comparable to otherBayesian structure learning methods, the block Gibbs sampler based computationalapproach makes it computationally simple and faster, especially for scaling up inlarge dimensions.

    8.4.4 Laplace approximation to compute posterior model probabilities

    For graphical structure learning, in the absence of explicit expressions, one maycompute the posterior probabilities of various models by reversible jump Markovchain Monte Carlo, which is computationally expensive. Banerjee and Ghosal [7]proposed to directly compute the marginal posterior probabilities of models usinga Laplace approximation method. The main idea is to expand the log-likelihood

  • 30 Sayantan Banerjee, Ismael Castillo and Subhashis Ghosal

    around the posterior mode, which can be easily identified as the graphical lassowithin that model and computed through efficient algorithms, and integrate the re-sulting approximate likelihood function. The resulting computation is very fast, buta drawback is that the approach works only for ‘regular models’, where none of thecomponents of the posterior mode is 0, because the log-likelihood function is sin-gular at any point having a coordinate 0. The problem is partly alleviated by the factthat for every ‘non-regular model’, there is a regular model with higher posteriorprobability. Hence, at least for model selection, the approach can be restricted to thelatter class.

    References

    1. A. Armagan, D. B. Dunson, and J. Lee. Generalized double Pareto shrinkage. StatisticaSinica, 23(1):119, 2013.

    2. C. Asci and M. Piccioni. Functionally compatible local characteristics for the local spec-ification of priors in graphical models. Scandinavian Journal of Statistics, 34(4):829–840,2007.

    3. A. Atay-Kayis and H. Massam. A Monte-Carlo method for computing the marginal likeli-hood in nondecomposable Gaussian graphical models. Biometrika, 92(2):317–335, 2005.

    4. Y. A. Atchadé. On the contraction properties of some high-dimensional quasi-posterior dis-tributions. The Annals of Statistics, 45(5):2248–2273, 2017.

    5. S. Banerjee. Posterior convergence rates for high-dimensional precision matrix