ices report 12-25 a metropolized adjusted newton algorithm ... · ples from target density ˇ(x),...

ICES REPORT 12-25

July 2012

A Metropolized Adjusted Newton Algorithm for MarkovChain Monte Carlo Simulations

by

Tan Bui-Thanh and Omar Ghattas

The Institute for Computational Engineering and SciencesThe University of Texas at AustinAustin, Texas 78712

Reference: Tan Bui-Thanh and Omar Ghattas, A Metropolized Adjusted Newton Algorithm for Markov ChainMonte Carlo Simulations, ICES REPORT 12-25, The Institute for Computational Engineering and Sciences, TheUniversity of Texas at Austin, July 2012.

A METROPOLIZED ADJUSTED NEWTON ALGORITHM FORMARKOV CHAIN MONTE CARLO SIMULATIONS

TAN BUI-THANH † AND OMAR GHATTAS ‡

Abstract. We propose a Metropolized adjusted Newton algorithm (MANA) for local Metropolis-Hastings Markov chain Monte Carlo (MCMC) simulations. The method can be considered as anEuler-Maruyama discretization of the Langevin diffusion on Riemann manifold with piecewise con-stant Hessian of the negative logarithm of the target density as the metric tensor. The MANAproposal consists of a deterministic and a stochastic parts. The former corresponds to a Newtonstep which attempts to drift the current state to region of higher probability, and hence potentiallyincreasing the acceptance probability. The latter is distributed by a Gaussian tailored by the lo-cal Hessian as the inverse covariance matrix. The proposal step is then corrected by the standardMetropolization to guarantee that the target density is the stationary distribution. We then studyasymptotic convergence and geometric ergodicity of MANA chains. At the heart of the paper is theoptimal scaling analysis, in which we show that, for inhomogeneous product target distribution atstationary, the MANA proposal variance scale like O

(n−1/3

)for the average acceptance rate to be

bounded away from zero, as the dimension n approaches infinity. As a result, a MANA chain exploresthe stationary distribution in O

(n1/3

)steps, regardless of the variance of the target density. The

optimal scaling behavior of MANA chains in transient phase is also discussed for Gaussian targetdensities, and an extension to inverse problem using the Bayesian paradigm is presented. Theoreticaloptimal scaling result is verified for two i.i.d. targets in high dimensions. We also compare the MANAapproach with other similar Hessian-aware methods on i.i.d. target, Bayesian logistic regression, andlog-Gaussian Cox process examples. Numerical results show that MANA outperforms the othersby providing Markov chains with small burn-in and small correlation length. Finally, we apply theMANA MCMC method to a Bayesian inverse thermal fin problem to predict the posterior mean andits uncertainty.

Key words. Markov chain Monte Carlo; Metropolis-Hastings; Optimal scaling; Hessian; Langevindiffusion; Riemannian manifold; Euler-Maruyama discretization.

AMS subject classifications. 65N30, 65N12, 65N15, 65N22.

1. Introduction. Perhaps the Metropolis-Hastings (MH) algorithm, first devel-oped by Metropolis et al. [23] and then generalized by Hastings [17], is the mostpopular Markov chain Monte Carlo method. Its popularity and attractiveness comefrom the easiness in implementation and minimal requirements on the target densityand the proposal density [29]. Indeed, a MH pseudo code for generating N sam-ples from target density π (x), x ∈ Rn, can be succinctly described in Algorithm 1.One of the simplest instances of Algorithm 1 is the random walk Metropolis-Hastings(RWMH) in which the proposal density q(xk, y ) is the isotropic Gaussian kernel

q (xk,y) =1√

(2π)nσ2n

exp

(− 1

2σ2‖y − xk‖2

),

and hence the proposal y is given by

y = xk + σN (0, In) , (1.1)

where In is the n × n identity matrix. The above proposal can be considered as theEuler-Maruyama discretization, with step size ∆t = σ2, of the following stochastic

‡Institute for Computational Engineering & Sciences, The University of Texas at Austin, Austin,TX 78712, USA.‡Institute for Computational Engineering & Sciences, Jackson School of Geosciences, and De-

partment of Mechanical Engineering, The University of Texas at Austin, Austin, TX 78712, USA.

1

2 T. Bui-Thanh and O. Ghattas

Algorithm 1 Metropolis-Hastings MCMC Algorithm

Choose initial x0

for k = 0, . . . , N − 1 do1. Draw a proposal y from the proposal density q(xk, y )2. Compute π(y), q(xk,y), and q(y,xk)

3. Compute the acceptance probability α(xk,y) = min

1, π(y)q(y,xk)π(xk)q(xk,y)

4. Accept and set xk+1 = y with probability α(xk,y). Otherwise, reject andset xk+1 = xk

end for

differential equation

dx(t) = dW (t),

with W (t) as the standard n-dimensional Brownian motion. The RWMH method issimple except for a small detail: how to choose the optimal time step σ2?

Choosing the time step, also known as the proposal variance, σ2 optimally isvital since it determines the mixing time which is the number of steps to explorethe stationary distribution. If σ2 is too small, it is most likely that all the proposalmoves are accepted, but the chain explores π (x) very slowly since the proposed jumpis small. On the other hand, if the proposal variance is large, it is most likely thatthe proposed move is in low probability regions, and hence rejected. This case alsoleads to slow mixing since the chain virtually does not move at all. As a result,the proposal variance should be in between these extremes, and this is known as theGoldilocks principle [37]. It turns out that the proposal variance for RWMH mustscale like σ2 = `2n−1, with some constant `, for the average acceptance rate to bebounded away from zero as the dimension n approaches infinity [15,30]. In this case,the optimal average acceptance rate can be shown to be 0.234.

Meanwhile, the Langevin dynamics governed by the following stochastic differen-tial equation

dx(t) =1

2∇ log (π (x)) dt+ dW (t) (1.2)

is well-known to admit π (x) as its stationary distribution. A natural idea analogousto RWMH is to discretize the Langevin equation using the Euler-Maruyama schemeto construct the proposal as

y = xk +σ2

2∇ log (π (xk)) + σN (0, In) . (1.3)

Although one can show that the diffusion process given by (1.2) converges to π (x)as its unique stationary measure [34], the discretization (1.3) may not inherit thisproperty as demonstrated in [35]. In order to avoid this undesired behavior, theEuler-Maruyama scheme (1.3) is typically equipped with the Metropolis mechanismoutlined in Steps 2, 3, and 4 of Algorithm 1 so that the Markov chain converges tothe desired distribution given by the target π (x). This gives rise to the so-calledMetropolis-adjusted Langevin algorithm (MALA) [35].

Compared to the random walk proposal (1.1), the Langevin proposal (1.3) has

the additional deterministic term σ2

2 ∇ log (π (xk)) (also known as the drift term). A

Analysis of MANA 3

careful look at this term reveals that it is the scaled negative gradient of f (xk) =

− log (π (xk)), and hence xk + σ2

2 ∇ log (π (xk)) is nothing more than a gradient de-scent step for minimizing f (x) in the optimization literature [26]. Consequently, thedrift term takes the current state xk to a point with smaller value of f (x), i.e. higherprobability density π (x), provided that σ2 is sufficiently small. This explains whyMALA chains in general mix better and explore the target faster than those gener-ated by RWMH. Indeed, the work in [31] shows that the optimal proposal variancescales like σ2 = `2n−1/3, and the optimal acceptance rate is approximately 0.574, asubstantial improvement over the RWMH algorithm.

The Langevin dynamics can be viewed as a particular instance of the followingstochastic differential equation

dx(t) = b (x) dt+ β (x) dW (t), (1.4)

where

bi (x) =1

2

n∑j=1

aij (x)∂

∂xjlog (π (x)) +

√δ (x)

n∑j=1

∂

∂xj

(aij (x)

√δ (x)

),

a (x) = β (x)βT (x), and δ (x) = det (a (x)). It can be shown that π (x) is the uniquestationary measure for (1.4) under mild conditions [34]. Equation (1.4) is also knownas Langevin diffusion on Riemann manifold with metric tensor a−1 (x) [16]. Clearly,(1.4) collapses to the standard Langevin dynamics (1.2) if the metric tensor a−1 isthe identity matrix.

Following the spirit of RWMH and MALA, one can discretize (1.4) as

y = xk +σ2

2a (xk)∇ log (π (xk)) + σβ (xk)N (0, In)

+ σ2√δ (xk)

n∑j=1

∂

∂xj

(aij (xk)

√δ (xk)

), (1.5)

and use this result as the proposal for the MH algorithm. However, each proposalstep is now complicated and expensive since (1.5) involves the derivative of the metrictensor in the last term. Note that this last term can be viewed as the changes inlocal curvature of the manifold. This observation immediately invites us to use theHessian ∇2f (x) as metric tensor, i.e., a−1 (x) = ∇2f (x). As a consequence, the drift

term σ2

2 a (xk)∇ log (π (xk)) becomes a scaled Newton step for minimizing f (x) [26].Similar to the above discussion, the drift term, assuming zero curvature changes,takes the current state xk to a point with smaller value of f (x), which is the same ashigher probability density π (x), provided that σ2 is sufficiently small and the Hessian∇2f (x) is positive definite. Since Newton method is in general more efficient thanthe gradient descent in finding smaller f (x) [26] we expect that MCMC chains usingproposal (1.5) mix better and explore the target faster than those using MALA.

In fact, using the Hessian to speed up MCMC simulations is not new. This canbe traced back to [38] under the name of smart Monte Carlo simulation, and itsconnection to the Newton method in optimization is succinctly presented in [11, 22].Since then, variants of the method have been rediscovered, including Hessian-basedMCMC [18,28] and Newton-like methods for MCMC [22,43,44]. On the other hand,it can be viewed as a preconditioned Langevin MCMC approach [40] or a manifold


Langevin MCMC [16]. These Hessian-aware approaches have been observed, in manycases, to outperform the other existing methods including MALA [22, 28]. Mostof them, however, use either fixed or random proposal variance σ2 without respectto the dimensions of the problem under consideration, especially large dimensions.Optimal scaling results for these Hessian-aware methods, similar to those of RWMHand MALA, are therefore desirable in order to use them efficiently in practice, butthere has been no attempt in this direction.

We take up the idea of using Hessian to speed up MCMC simulations again in thispaper. We shall show how the connection presented above for the proposal y withoptimization and with the Euler-Maruyama discretization of stochastic differentialequations leads us to a new method called Metropolis-adjusted Newton algorithm(MANA) in Section 2. After that, we discuss asymptotic convergence of MANAchains in Section 3, followed by their geometric ergodicity in Section 4. The optimalscaling analysis for the MANA method is carried out at length in Section 5. Inparticular, we aim to answer the question on how the MANA proposal variance scalesso that the asymptotic average acceptance rate is bounded away from zero, as thedimension approaches infinity. We also discuss the optimal scaling in transient phaseof MANA chain for Gaussian targets and an extension of optimal scaling analysisto Bayesian posterior measure with Gaussian priors. Next, we verify the theoreticalresults and compare MANA with the simplified manifold MALA of [16], the Hessian-based MCMC of [28], the stochastic Newton of [22], and a variant of MANA in Section6. As an application to uncertainty quantification, we apply the MANA approach toestimate the posterior mean of a Bayesian inverse thermal fin problem and discuss itsuncertainty. Finally, Section 7 concludes the paper with discussions on future studies.

2. A Metropolis-adjusted Newton algorithm (MANA). Motivated by thediscussion in Section 1, we would like to use the Hessian as the metric tensor, i.e.,

a−1 (x) = ∇2f (x) .

However, the last term of (1.5) is in general not zero. To avoid this expensive term, wemake the following assumption. We assume that the local curvature of the manifoldbetween xk and y is constant, and drop the last term. In other words, we discretizethe general Langevin diffusion (1.4) using an Euler-Maruyama scheme with piecewiseconstant metric tensor. If we do that, we end up with the new proposal

y = xk +σ2

2A∇ log (π (xk)) + σN (0,A) , (2.1)

where A is the inverse of the Hessian, i.e.,

A−1 = H = −∇2 log (π (xk)) , (2.2)

evaluated at xk and assumed to be fixed when we compute the probability q(xk,y)of moving from xk to y, and the the probability q(y,xk) of moving from y to xk:

q(xk,y) =

√detH√(2π)n

exp

− 1

2σ2

∥∥∥∥y − xk − σ2

2A∇ log (π (x))

∥∥∥∥2A−1

, (2.3)

q(y,xk) =

√detH√(2π)n

exp

− 1

2σ2

∥∥∥∥xk − y − σ2

2A∇ log (π (y))

∥∥∥∥2A−1

, (2.4)

Analysis of MANA 5

where we have defined the weighted norm ‖·‖A−1 =(·,A−1·

)for any symmetric

positive definite matrix A.The deterministic part of (2.1) is therefore a (half) Newton step at x if σ = 1.

For this reason, we shall call our method as Metropolis-adjusted Newton algorithm(MANA), and its complete description is given in Algorithm 2.

Algorithm 2 Metropolis-adjusted Newton Algorithm

Choose initial x0

for k = 0, . . . , N − 1 do1. Compute ∇ log (π (xk)) ,H(xk) = −∇2 log (π (xk))2. Draw sample y from the proposal density q(xk, · ) defined in equation (2.3)3. Compute π(y), q(xk,y) as in (2.3), and q(y,xk) as in (2.4)

4. Compute the acceptance probability α(xk,y) = min(

1, π(y)q(y,xk)π(xk)q(xk,y)

)5. Accept and set xk+1 = y with probability α(xk,y). Otherwise, reject andset xk+1 = xk

end for

3. Asymptotic convergence of MANA. In this section we discuss the asymp-totic convergence of the Markov chain generated by MANA. To begin, we recall theV-norm, see [35] for example, of any probability measure µ as

‖µ‖V = supf :|f |≤V

∫Rn

f µ (dx) ,

where V is some measurable function on (Rn,B (Rn)). Here, B (Rn) denotes the Borelσ−algebra of Rn. Note that if V = 1, the V−norm is the standard total variationnorm, which will be denoted as ‖·‖TV .

Now, denote PN (x, ·) as the nth-step transitional law of the Markov chain.The following standard result [24,33,42] for Markov chains generated by Metropolis-Hastings algorithm applied to MANA chain is straightforward.

Theorem 3.1. Assume that the proposal in (2.3) is continuous and positiveon Rn × Rn, and the target density π (x) is finite everywhere. Then the followingasymptotic results for MANA chain hold:

i) limN→∞∥∥PN (x, ·)− π (·)

∥∥TV

= 0, ∀x ∈ Rn.ii) If h (x) : Rn → R is integrable, then with probability 1

limN→∞

1

N

N∑i=1

h (xi) =

∫Rn

hπ (dx) .

Proof. The fact that a MANA chain is Markov is clear. Since A is constantduring the Metroplization, the detailed balance is satisfied. A simple modification ofthe standard arguments [24,33,42] ends the proof.

4. Geometric Ergodicity of Metropolis-adjusted Newton algorithm. Sec-tion 3 ensures that if a MANA chain is long enough, its distribution converges to thelaw of π. However, the convergence rate is unspecified. In fact, the convergence couldbe very slow. A popular concept to assess the performance of a MCMC method isgeometric ergodicity. A geometric ergodic Markov chain with stationary distributionπ satisfies∥∥PN (x, ·)− π (·)

∥∥TV≤M (x) ρN , ∀N ∈ N, π − a.e. x ∈ Rn,


with M (x) <∞ and ρ < 1.Instead of striking for total generality, which may not be possible [33, 36], we

restrict our ergodicity result for a class of super-exponential targets [1,19], which arepositive with continuous gradient such that

lim|x|→∞

x

‖x‖∇ log (π (x)) = −∞.

Note that this condition only concerns the tail behavior of the target π. In particular,it implies that the tail is necessary to be at least exponential decaying. We furtherassume that

lim|x|→∞

x

‖x‖∇ log (π (x))

‖∇ log (π (x))‖< 0. (4.1)

Here is a geometric ergodicity result for MANA chains whose proof can be adaptedfrom [1] in a straightforward manner.

Proposition 4.1. Assume that the stationary distribution π of a MANA chainis super-exponential satisfying (4.1). Let g (x) and spectrum of H (x) be bounded.Then there exists V ≥ 1 such that∥∥PN (x, ·)− π (·)

∥∥V≤ CV (x) ρN , ∀N ∈ N, ∀x ∈ Rn.

In particular, the MANA chain is geometric ergodicity.Note that geometric ergodicity is a qualitative result and one needs to be careful

about its meaning. In the worst case scenario, the product CV (x) could be large andthe chain may have unpredicted behavior [33].

Note that the conditions to obtain geometric ergodicity in Proposition 4.1 arenot unique. Indeed, other sufficient conditions are possible [20, 35] though we do notpursue them here.

5. Complexity and optimal scaling analysis for MANA in high dimen-sions. In this section, we discuss the complexity and optimal scaling of the MANAfor high dimensional problems, particularly when the dimension n approaches infin-ity. We shall address of the question how to choose the proposal variance σ2 in (2.1)optimally and the corresponding optimal acceptance rate. The target of interest herehas the following product form

π (x) = Πni=1

1

λiexp

(g

(xiλi

)), (5.1)

where g : R → R− = R \ (0,+∞), and limxi→±∞ g (xi/λi) = −∞ for i = 1, . . . , n.The variance of each component λ2i is allowed to depend on the dimension of theproblem. Note that no explicit dependence of λi on n is required in this paper whilea special form of λ2i is assumed in [3, 4] to make analytical computations and proofspossible. Following [3, 4, 31], we assume that all moments of f are finite and g isinfinitely differentiable with polynomially bounded derivatives.

As discussed above, the proposal variance σ2 should be neither small nor large fora MCMC chain to mix well. A question immediately arises is what optimal criteria issensible. A criteria that leads to explicit expressions for the optimal proposal varianceand optimal acceptance rate is to maximize the square-jump distance while makingthe acceptance probability bounded away from zero [3]. Other criteria have beendiscussed and studied in the literature, e.g., see [30–32] and the references therein.

Analysis of MANA 7

Following the program in [3], we first find the condition for the proposal variance σ2

of the MANA algorithm, under which the acceptance probability is bounded awayfrom zero when the dimension approaches infinity. Then we determine the maximalsquare-jump distance and finally deduce the corresponding optimal acceptance rate.

5.1. Asymptotic average acceptance rate. Let us define

h(j) (xi) =1

λjig(j) (xi/λi) , (5.2)

where the superscript (j) denotes the jth derivative. A simple computation gives

∇ log (π (x)) =[h(1) (x1) , . . . , h(1) (xn)

]T,

H (x) = −∇2 log (π (x)) = −diag[h(2) (x1) , . . . , h(2) (xn)

]T,

where the superscript T denotes matrix or vector tranposition and diag operator mapsa column vector to the corresponding diagonal matrix. The positiveness assumptionon H translates into the positiveness condition on each component, i.e., −h(2) (xi) > 0.We further assume that −g(2) (xi/λi) is bounded below away from zero and aboveaway from infinity uniformly in xi, namely, 0 < ` < −g(2) (xi/λi) < L < ∞. Conse-quently, the ith component of the proposal move in (2.1) can be written as

yi = xi −σ2

2

h(1) (xi)

h(2) (xi)+

σ√−h(2) (xi)

zi,

where zi, i = 1, . . . , n are independent and identically distributed (i.i.d.) standardnormal random variables. If h(j) = 0 our method simplifies to a preconditionedRWMH which we shall exclude in the following analysis. Throughout this section, weassume that our Markov chain is started at stationary, i.e., x0 ∼ π (x).

The proposal densities in (2.3) and (2.4) now read

q(x,y) =

√(−1)

nΠni=1h

(2) (xi)√(2π)n

exp

1

2σ2

n∑i=1

[yi − xi −

σ2

2

h(1) (xi)

h(2) (xi)

]2h(2) (xi)

,

q(y,x) =

√(−1)

nΠni=1h

(2) (xi)√(2π)n

exp

1

2σ2

n∑i=1

[xi − yi −

σ2

2

h(1) (yi)

h(2) (xi)

]2h(2) (xi)

.

In this case, the acceptance probability becomes

α (x,y) = 1 ∧ exp (T (σ)) ,

where operator ∧ takes minimum value of its left and right arguments, and T (σ) isdefined by

T (σ) =

n∑i=1

[h (yi)− h (xi)] +(xi − yi)

2

(h(1) (yi) + h(1) (xi)

)+

σ2

8h(2) (xi)

[(h(1) (yi)

)2−(h(1) (xi)

)2].

In the sequel, we need the following technical Lemma.Lemma 5.1. Suppose T is a real-valued random variable. There hold:


i) For any c > 0,

E [1 ∧ exp (T )] ≥ exp (−c)(

1− E [|T |]c

)ii) If T ∼ N

(µ, δ2

), then the following holds

E [1 ∧ exp (T )] = Φ (µ/δ) + exp(µ+ δ2/2

)Φ (−δ − µ/δ) ,

where Φ (·) is the cumulative distribution function of the standard normaldistribution.

Proof. See [4] for a proof.Next performing Taylor expansion for T (σ) around σ = 0 up to the sixth order

term gives

T (σ) =6∑j=0

σjn∑i=1

Ci,j︸︷︷︸Cj

+σ7n∑i=1

Ci,7 (σ∗i )︸︷︷︸C7

. (5.3)

Recall that our goad is to find conditions for the proposal variance σ2 so that theaverage acceptance probability E [α (x,y)] is bounded away from zero. To this end,we observe that

E [α (x,y)] = E [1 ∧ exp (T (σ))] ≥ exp (−c)[1− E [|T (σ)|]

c

], ∀c > 0,

where the last inequality is from Lemma 5.1. Consequently, our goal is fulfilled ifE [|T (σ)|] is bounded from above. From (5.3), it is sufficient to seek the dependenceof the proposal variance σ on the dimension n for which the expectation of the absolutevalue of each term in (5.3) is bounded from above as n approaches infinity.

Observe that Ci,j , i = 1, . . . , n, are i.i.d. with respect to i, and hence the expecta-tion of any power of Ci,j does not depend on i. For example, we simply write E [C·,j ]as the expectation of any Ci,j , for i = 1, . . . , n. Moreover, Ci,j and Ri do not dependexplicitly on λi, and hence on the dimension n, but only implicitly through g and itsderivatives. This is an important feature of MANA that makes it more efficient, asshall be shown, compared to MALA or RWMH. For example, Ci,3 is given by

Ci,3 = −3zih(1)(xi)h

(2)(xi) + z3h(3)(xi)

12(−h(2)(xi)

)3/2 = −3zig(1)(xi/λi)g

(2)(xi/λi) + z3g(3)(xi/λi)

12(−g(2)(xi/λi)

)3/2 .

It should be pointed out that the Taylor expansion (5.3) of T (σ) is almost identicalto that of the MALA method [3, 4]. However, the coefficients Ci,j corresponding to

(2.2) are those corresponding to the MALA scaled by a factor 1/(√−h(2) (xi)

)j. It

is this extra factor that makes the analytical computation difficult. In particular, itis not clear to us whether the assertions E [Ci,4] = 0 and 2E [Ci,6] + E

[C2i,3

]= 0 still

hold true in general. Therefore, let us study C3, C5, C6 and C7 first, then C4.Similar to the analyis of the MALA method [31], we observe that C0 = C1 =

C2 = 0, and E [C3] = E [C5] = 0 since they involve only odd orders of zi. Here Edenotes the expectation with respect to both x and z.

Analysis of MANA 9

We begin by bounding E [|Cj |] for j = 3, 5. Since |·|2 is convex, an application ofJensen’s inequality gives

E [|C3|] ≤ n1/2σj√E[C2·,j], j = 3, 5. (5.4)

Owing to the assumption that g has polynomially bounded derivatives and f hasfinite moments, E

[C2·,j]

is bounded uniformly in n. As a result, the right side of (5.4)

is bounded as n→∞ if σ2 = O(n−1/3

).

For C6, a simple application of the triangle inequality gives

E [|C6|] ≤ nσ6E [|C·,6|] ,

which is clearly bounded as n→∞ if σ2 = O(n−1/3

).

In order to bound C7 one realizes that Ci,7 is a ratio of product of polynomialsof derivatives of g (xi/λi). This, together with the assumption on g, implies that

|Ci,7| ≤ P1 (xi/λi)P2 (zi)P3 (σ∗i ) ,

where P1, P2 and P3 are positive polynomials. We have the following two observations.First, Exi/λi

[P1 (xi/λi)] and Ezi[P2 (zi)] are bounded independent of i due to the

finiteness of moments of f and the Gaussianity of zi, respectively. Second, P3 (σ∗i )is bounded above by some constant independent of n if σ → 0, since 0 ≤ σ∗i ≤ σ.Therefore, we can bound E [|C7|] as

E [|C7|] ≤ cnσ7Exi/λi[P1 (xi/λi)]Ezi [P2 (zi)]P3 (σ∗i ) ,

for some constant c independent of n and sufficiently large. It is readily to see thatσ2 = O

(n−1/3

)is sufficient for E [|C7|] to be bounded from above.

For C4, we have not yet been able to bound it when the proposal variance scalesas σ2 = O

(n−1/3

), so let us assume that C4 converges to `4K3 in L2 (π (x)⊗N (0, I))

as n→∞. By the triangle and Jensen’s inequalities one has

E [|C4|] ≤ E[∣∣C4 − `4K3

∣∣]+ E [|K3|] ≤√

E[(C4 − `4K3)

2]

+ E[∣∣`4K3

∣∣] ,from which it follows that E [|C4|] is bounded from above.

Remark 5.2. The above assumption on C4 is clearly satisfied if f is a Gaussian,for which K3 = 0. However, whether this holds for larger class of densities f is anopen question.

Combining all the estimates we conclude that E [|T (σ)|] is bounded from above ifσ2 = O

(n−1/3

). This motivates us to define σ2 = `2nε−1/3, where ` is independent of

n. We have showed that the average acceptance rate E [α (x,y)] is bounded away fromzero if ε = 0. Furthermore, the above analysis also shows that E [|T (σ)|] converges tozero, and hence E [α (x,y)] approaches 1, as n→∞ if ε < 0.

What remains to be investigated is the case ε ∈ (0, 1/3). If E [Ci,4] = 0 and2E [Ci,6] + E

[C2i,3

]= 0 hold, an argument similar to that in [3, 4] would yield zero

average acceptance rate as n → ∞. However, the appearance of(√−h(2) (xi)

)jin

the denominator of Ci,j renders analytical computation intractable, and we have notyet been able to obtain such a result. Instead of making further assumptions, we leavethe analysis of the case ε ∈ (0, 1/3) as a subject for future work, and carry on ourprogram with ε ≤ 0.


As argued above, ε < 0 leads to unity average acceptance rate, which correspondsto the case of small proposal variance σ2. That is, the Markov chain evolves veryslowly in exploring the stationary distribution, and hence is not of our interest. Fromnow on to the end of the paper, the proposal variance is specified as σ2 = `2n−1/3.

We have provided sufficient conditions for the average acceptance rate to bebounded away from zero, but we have not yet derived its explicit expression. This isour next goal. We first show that C3 converges weakly using Lindeberg-Feller centrallimit theorem [14]. To that end, we define

Xn,i = σ3Ci,3,

then it is obvious that Xn,i, i = 1, . . . , n, are independent random numbers andE [Xn,i] = 0. Next, simple applications of the integration by parts give

n∑i=1

E[X2n,i

]= σ6

n∑i=1

1

48Exi

[9

(h(1) (xi)

)2−h(2) (xi)

+ 5

(h(3) (xi)

)2(−h(2) (xi)

)3 − 6

]︸︷︷︸

K1

= `6K1,

where K1 is independent of xi, since xi, i = 1, . . . , n, are i.i.d. random variables.Furthermore, since E

[C2·,3]<∞, an application of Lebesgue dominated convergence

theorem yields, for any γ > 0,

limn→∞

n∑i=1

E[X2n,i1|Xn,i|>γ

]= limn→∞

`2E[C2·,31|C·,3|>γ`−2

√n]

= 0,

with 1· denoting the indicator function. At this point, we have demonstrated thatall the conditions of Lindeberg-Feller theorem are satisfied, whence

C3 =

n∑i=1

σ3Ci,3 =

n∑i=1

Xn,i ⇒ `3√K1,

in the sense of weak convergence (also known as convergence in distribution [14]) asn→∞.

Now, combing σ2 = `2n−1/3, E [C5] = 0, and straightforward algebra showsthat C5 converges in L2 (π (x)⊗N (0, I)) to zero. Similarly, one can show that C6

converges to `6K2, where K2 = E [C·,6], in L2 (π (x)⊗N (0, I)). The convergence ofC7 to zero in L2 (π (x)⊗N (0, I)) is clear as shown above.

Since the covergence in Lp (π (x)⊗N (0, I)), for any 1 ≤ p < ∞, implies cover-gence in distribution, we obtain the following weak convergence of T (σ)

T (σ)⇒ N(`6K2 + `4K3, `

6K1

).

Finally, applying the second assertion in Lemma 5.1 gives the following expression forthe asymptotic average acceptance rate

a (`) = limn→∞

E [α (x,y)] = Φ

(`3K2 + `K3√

K1

)+ exp

(`6 (K2 +K1/2) + `4K3

)Φ

(−`

3 (K1 +K2) + `K3√K1

). (5.5)

Analysis of MANA 11

5.2. Optimal scaling. Following [4] we use the maximum square-jump distanceS, also known as the first-order efficiency [31], as the criterion for efficiency of Markovchains. By definition, the square-jump distance for a component xi of the currentstate x is specified by

S = E[(xi − x∗i )

2],

where x∗ = y with probability α (x,y). We can rewrite S as

S = E[(xi − yi)2 α (x,y)

]= E

[(xi − yi)2 1 ∧ exp (T (σ))

].

Let us define T i (σ) to be equal to T (σ) subtracting the ith summands from Cj forj = 1, . . . , 7. We have

T (σ)− T i (σ) =

6∑j=3

σjCi,j (σ) + σ7Ci,7 (σ∗) ,

which clearly converges to zero in L2 (π (x)⊗N (0, I)) as n→∞. This result togetherwith the Lipchitz continuity of 1 ∧ exp (·) [30] and Cauchy-Schwarz inequality showsthat

limn→∞

n1/3E[(xi − yi)2

(1 ∧ exp (T (σ))− 1 ∧ exp

(T i (σ)

))]= 0. (5.6)

Next, we decompose n1/3S as follows

n1/3S = n1/3E[(xi − yi)2 1 ∧ exp

(T i (σ)

)]︸︷︷︸

S1

+ n1/3E[(xi − yi)2

(1 ∧ exp (T (σ))− 1 ∧ exp

(T i (σ)

))]︸︷︷︸

S2→0 by (5.6)

.

Then,

limn→∞

S1 = `2 limn→∞

Exi

[− λ2ig2 (xi/λi)

]E[1 ∧ exp

(T i (σ)

)]= `2 lim

n→∞Exi


]a (`) ,

due to (5.6) and the definition of a (`) in (5.5). Therefore, we arrive at

S = Exi


]`2a (`)n−1/3 + o

(n−1/3

). (5.7)

Let us now summarize the above analysis.Theorem 5.3. Let the proposal variance be σ2 = `2n−1/3 and assume that C4

converges to `4K3 in L2 (π (x)⊗N (0, I)), the asymptotic average acceptance ratea (`) is bounded away from zero and is given by (5.5). Furthermore, the square-jumpdistance S scales like (5.7) as n → ∞, and in particular S is maximized if `2a (`)attains its maximum.

Remark 5.4. Ideally, one first seeks `∗ to maximize `2a (`), then obtain theoptimal acceptance rate as a (`∗). Unlike the theories for RWMH [15,30] and MALA


[31], the dependence of the average acceptance rate (5.5) on four parameters `,K1,K2

and K3 prevents us from doing so. Our numerical results, however, suggest that ` beof order 1, as we shall show.

Clearly, when f is a Gaussian our MANA algorithm becomes a preconditionedMALA with the following preconditioned matrix

A = diag(λ21, . . . , λ

2n

). (5.8)

In this case, 2K2 + K1 = 0 holds and K3 = 0. Hence, the asymptotic averageacceptance rate is 0.574 and the efficiency is the same as that of the i.i.d. case (λi = 1).In other words, the inefficiency factor, see [20] for example, is unity even though thetarget (5.1) is heterogeneous. It is the introduction of the inverse of Hessian as thepreconditioned matrix that places different components on the same scale as if thetarget were an i.i.d. distribution. It should be further pointed out that, in practice,however, it would not be possible to determine the variances λ2i , for i = 1, . . . , n,exactly due to the complexity of the target distribution π (x). Alternatively, one canestimate them using a pilot MCMC run [6], but it could be costly. On the contrary,our choice of the preconditioner A in (2.2) explores the structure of the underlyingπ (x) and implicitly computes the hidden scalings λ2i exactly in an automatic manner.Of course, extra cost arises due to the computation of the Hessian-vector product,and we will show how to carry out this task efficiently.

One can remove the assumption on the convergence of C4 in L2 (π (x)⊗N (0, I))by taking smaller proposal variance, e.g., σ2 = `2n−1/2. In this case, one can showthat C3, C5, C6 and C7 converge to zero in L2 (π (x)⊗N (0, I)), while C4 convergesto `4E [C·,4] in L2 (π (x)⊗N (0, I)). As a result, T (σ) converges to N

(`4E [C·,4] , 0

).

This, together with the second assertion in Lemma 5.1, implies

a (`) =

1 if E [C·,4] > 0exp

(`4E [C·,4]

)otherwise

,

which is clearly bounded away from zero. Nevertheless, smaller proposal variancemeans less efficient algorithm as argued above.

5.3. Scaling limit in transient phase. We have assumed that the MCMCchain ideally starts at stationary and then studied the proposal variance (scaling) asthe dimension n approaches infinity. All the above results are therefore only valid forchains after the burn-in period. A question immediately arise is what the scaling limitis in the transient phase. An attempt to answer this question has been carried outin [12] for standard Gaussian target π (x), in which the authors show that the scalinglimit for RWMH is σ2 = O

(n−1

)while it is σ2 = O

(n−1/2

)for MALA. Consequently,

the MALA proposal variance must be smaller in the transient phase as compared tothat in the stationary phase; otherwise, the chain would take a long time to escapethe transient period.

The MANA method, when applied to standard Gaussian targets, is identical tothe MALA, and hence all the theoretical results for MALA in [12] are carried overfor MANA. In particular, as we shall demonstrate numerically, it is necessary to usesmall proposal variance σ2 = O

(n−1/2

)in the transient phase and then switch to

larger one σ2 = O(n−1/3

)in stationary for efficiency.

5.4. Scaling limit for non-product targets. Since the fundamental worksof Roberts et al. [30–32] on the optimal scalings for RWMH and MALA for producttargets, the analysis has been extended to various more general and practical targets

Analysis of MANA 13

including product targets with heterogeneity scales [2], non-product elliptical targets[39], non-product targets resulting from nonlinear regression problems [5], and non-product targets coming from nonlinear statistical inverse problems [3], to name afew.

Our interest is on Bayesian inversion problems in which the target density is theposterior obeying the following change of measure formula

dπ (x)

dπ0 (x)∝ exp (−Φ (x)) ,

where π0 (x) is the prior density having the product structure (5.1), and exp (−Φ (x))the likelihood. Following [3, 4], let us consider a variant of MANA for π (x) with thefollowing proposal

y = x+σ2

2A∇ log (π0 (x)) + σN (0,A) , (5.9)

with A being the Hessian of − log (π0 (x)), namely,

A = diag[λ21, . . . , λ

2n

].

Note that we have used the gradient and the Hessian of the logarithm of the priorinstead of the posterior to define the MANA. The following scaling limit result isimmediate.

Lemma 5.5. Let the proposal variance be σ2 = `2n−1/3, and Φ (x) be uniformlybounded from above, i.e.,

supx∈Rn

|Φ (x)| <∞.

Then, the asymptotic average acceptance rate a (`) is bounded away from zero.Proof. The above MANA variant coincides with a method considered in [3, 4],

and hence the proof follows.Note that the above result, in some sense, is the best result that we can ask

for using only the prior information and the boundedness of the Φ (x) (the misfitfunctional as we shall show). Using the Hessian and gradient of the logarithm of theposterior is desirable, but this is computationally expensive in the context of inverseproblems governed by partial differential equations (PDEs), yet whether the scalinglimit is improved is not clear [4]. On the other hand, the assumption on productstructure of the prior is not a limitation if one is working with Gaussian measureprior [41]. It is this prior product structure and the change of measure nature of theBayesian methodology that allows the scaling limit results on product targets to holdfor complicated Bayesian posterior densities.

6. Numerical experiments. We compare the simplied manifold MALA [16](abbreviated in this paper as mMALA), the Hessian-based Metropolis-Hasting (HMH)[28], the un-globalized1 stochastic Newton method (USN) [22], and the MANA methodfor several examples. Choosing a reliable comparison criteria for different MCMCmethods is not a trivial task. Indeed, one can use either of 13 convergence diagnosticsreviewed in [13], but none of them assure with certainty that the Markov chain is rep-resentative of the stationary distribution π (x). For simplicity, we run each algorithm

1One of the ongoing researches [21] is to globalize the stochastic Newton that used in [22].


until it converges (by inspection), and then follow [16] to use the effective sample size(ESS) as the comparison criteria.

For completeness, let us recall the proposal of all methods here. The mMALAproposal is given by

ymMALA = x+σ2

2A (x)∇ log (π (x)) + σN (0,A (x)) ,

where A (x) is the expected Fisher information matrix at x. The HMH proposal isdefined as

yHMH = x+ γH−1 (x)∇ log (π (x)) +N(0,H−1 (x)

),

where H (x) is the Hessian matrix of − log (π (x)) at x, and γ is the learning rate, auniform random number with values in [0, 1]. The USN algorithm, on the other hand,specifies the proposal as

yUSN = x+H−1 (x)∇ log (π (x)) +N(0,H−1 (x)

),

where H (x) is the Hessian matrix of − log (π (x)) at x. Finally, our MANA methoduses the following proposal

yMANA = x+σ2

2A∇ log (π (x)) + σN (0,A) ,

whereA is the inverse of the Hessian matrixH = −∇2 log (π (x)) at x. One of the keydifferences between MANA and the others is that A, though a function of the currentstate x, is freezed during the computation of the acceptance rate. That is, in theMANA approach we use the same Hessian matrix H (x) to compute the probabilityof going from x to y, namely q(x,y), and from y to x, namely q (y,x), whereaseither the Hessian H or the expected Fisher information matrix A is recomputed aty for q (y,x) in other methods. To further demonstrate the advantage of freezing theHessian in the MANA method, we consider its natural variant, to be called MANAm,in which the Hessian is recomputed at y for q (y,x). The MANAm uses the followingproposal

yMANAm = x+σ2

2H−1 (x)∇ log (π (x)) + σN

(0,H−1 (x)

), (6.1)

with H (x) = −∇2 log (π (x)). If one performs a similar optimal scaling analysisusing the same technique as in Section 5, he will find that the optimal scaling in thiscase is σ2 = `2n−1 since only C0 = 0 in (5.3). So freezing the Hessian during theMetropolized step in the MANA approach is both theoretically and computationallyadvantageous.

Note that we choose to compare the proposed MANA approach with mMALA,HMH, USN, and MANAm because they are similar in the sense that all ignore theterm involving the change of curvature in (1.5). One can also view this comparisonas a comparative study of five different local Hessian-aware preconditioned Langevinmethods.

6.1. Verifying the MANA optimal scaling for i.i.d. targets. Before com-paring MANA with other methods, let us examine the theoretical prediction on thestationary optimal scaling σ2 = `2n−1/3 for standard normal distributions as n in-creases. For standard normal distribution, the optimal scaling for MANA coincides

Analysis of MANA 15

with that of MALA, namely, σ2 = 1.652n−1/3 and the optimal acceptance rate isabout 0.574. Here, ` = 1.65 is the stationary value for standard Gaussian target com-puted from the MALA optimal scaling theory [31]. We consider two cases; Case I withproposal variance of σ2 = 1 and Case II with proposal variance of σ2 = 1.652n−1/3.For both cases and for all dimensions considered, we run the MANA MCMC ten timeseach with 5000 samples starting at stationary and present the average number of ac-cepted proposals over ten runs in Table 6.1. As can be observed, Case I with fixedproposal variance has the acceptance rate converging to zero as the dimension n in-creases, while Case II with the optimal scaling of σ2 = 1.652n−1/3 has the acceptancerate converging to 2887/5000 ≈ 0.574 as predicted by the theory.

Table 6.1MANA accepted proposals as the dimension n increases for standard normal distributions.

dimension n1 10 100 200 500 100000

CaseI 4614 3494 1075 397 21 0II 3361 2906 2896 2884 2863 2887

We next consider a non-Gaussian target, but still i.i.d., given by

g (xi) =

− 1

2x2i − 0.1 exp

(− 1

2x2i

)if xi ≥ 0

− 12x

2i if xi < 0

. (6.2)

For this example, we shall compare the acceptance rate of all Hessian-aware methodsdiscussed above. For MANA, we respect our theory by taking σ2 = 1.652n1/3, where` is again taken to be the stationary value computed for standard Gaussian target,namely ` = 1.65. Note that mMALA coincides with MANA for this example. ForMANAm, we also take σ2 = 1.652n1/3. For each method, we run the correspondingMCMC simulation 10 times each with 5000 samples (excluding 1000 “burn-ins”),and then compute the average number of accepted proposals over ten runs in Table6.2. As can be seen, the acceptance rate is bounded away from zero for MANAwhile it converges to zero as the dimension n increases for MANAm and USN. Infact, using MANA, the stationary constant for standard Gaussian target ` = 1.65 isquite conservative for target (6.2) since the acceptance rate increases as the dimensiongrows. It is surprising that the average acceptance rate for HMH seems to convergeto 0.37, and this begs for future investigation on the HMH method. Again, the onlydifference between MANA and MANAm is that MANA freezes the Hessian evaluatedat the current state x while MANAm re-computes the Hessian at the proposal y.The numerical results agree with the theory that freezing the Hessian during theMetropolized step is advantageous both theoretically and numerically.

6.2. Bayesian logistic regression. In this section, we study the Bayesian logis-tic regression example using five data sets—Pima Indian, Australian credit, Germanncredit, Heart, and Ripley—considered in Section 7 of [16]. The target density is givenby

π (x) ∝ exp

(fT t−

n∑i=1

log (1 + exp (fi))−1

2α2‖x‖2

),

where α = 100, f = Wx; W ∈ Rn×n and t ∈ Rn take different values for differentdata sets.


Table 6.2Accepted proposals for different Hessian-aware methods as the dimension n increases for non-

Gaussian target distribution (6.2).

dimension n100 500 1000 2000 10000 100000

Methods

MANA 2979 3237 3449 3752 4743 5000MANAm 1970 462 85 6 0 0

USN 1837 212 23 0 0 0HMH 2048 1889 1871 1864 1878 1850

For all data sets, we perform 10 runs, each of which is with 10000 MCMC itera-tions and with different random initialization. We then discard the first 5000 iterationsas the burn-ins, and compute the minimum, mean, and maximum ESS. Let us definethe effective time to generate a sample, timeeff , as the ratio of the minimum ESS tothe overall simulation time measured by Matlab’s tic-toc command.

We choose the scaling for mMALA and MANA as follows. For mMALA, wesimply take σ2 = 1 as suggested in [16]. Our asymptotic theoretical results in Section5 suggests σ2 = `2n−1/3 for high dimensional problems. For the Bayesian logisticregression problems considered here, the highest dimension is 24 while the smallestone is 2. We therefore choose ` such that σ2 = 1 to be fair with other methods, andthis amounts to taking `2 = 2.88 for n = 24 and `2 = 1.26 for n = 2.

The results are shown in Tables 6.3–6.7. As can be observed, the USN methodseems to least efficient for all data sets except for the Pima. Recall that USN samplingis exact if the target is Gaussian. On the one hand, this suggests that the target π (x)be close to Gaussian so that USN is the most efficient. On the other hand, the resultsimply that USN is not recommended for target that is far away from Gaussian. TheMANA method seems to be most efficient in general. In addition, MANA methodin general takes less time than other methods since we compute the Hessian and itsinverse only for accepted proposal y.

Table 6.3Comparison of MANA, mMALA, MANAm, USN, and HMH on the Bayesian logistic regression

example using Ripley data set (n = 2).

Method Time (s) ESS (min, mean, max) timeeff speed

MANA 2.1147 (372, 656, 904) 0.0057 4.21mMALA 2.057 (265, 386, 504) 0.0078 3.08MANAm 2.4154 (239, 361, 477) 0.0101 2.38HMH 2.325 (238, 318, 401) 0.0098 2.45USN 2.3986 (100, 180, 280) 0.024 1.0

In order to access the level of mixing and correlation of each Markov chain westudy the typical trace plot of the first component x1 of the random vector x andthe corresponding average autocorrelation function (ACF) over 10 runs for each ofthe method. The results for other components show similar behavior and henceomitted. We present the trace plot and average ACF for the German data—thehighest dimension case—in Figure 6.1. As can be seen, MANA chain seems to mixbetter than the others and its samples are least correlated. Chains generated bymMALA method are generally better than those of USN and HMH methods, whileHMH and USN chains seem to be comparable to each other. These observations are

Analysis of MANA 17


example using Pima data set (n = 7).




example using Heart data set (n = 13).




example using Australian data set (n = 14).




example using German data set (n = 24).



consistent with the above comparison using ESS.

6.3. Log-Gaussian Cox point process. Our next example is the log-GaussianCox point process studied in [25] and then used in [16] to compare various MCMCmethods. We use the same data and Matlab code for the simplified manifold MALAas provided in [16]. In particular we would like to make inference on the n = 4096


1000 2000 3000 4000 5000−1.5

−1

trace

MA

NA

0 5 10 15 20−1

0

1

Lag

Sample Autocorrelation Function (ACF)

1000 2000 3000 4000 5000−1.5

−1

mM

ALA

0 5 10 15 20−1

0

1

Lag

1000 2000 3000 4000 5000−1.5

−1

MA

NA

m

0 5 10 15 20−1

0

1

Lag

1000 2000 3000 4000 5000−1.5

−1

HM

H

0 5 10 15 20−1

0

1

Lag

1000 2000 3000 4000 5000−1.5

−1

SN

0 5 10 15 20−1

0

1

Lag

Fig. 6.1. Typical trace plot of the first random variable x1 and the average autocorrelationfunction using German data set (n = 24) for MANA, mMALA, MANAm, USN, and HMH methods.

latent variables x from the following target density

π (x) ∝ exp(tTx−m exp

(xT1

)− (x− µ1)

TΣ−1 (x− µ1)

),

where Σij = γ2 exp (−δ (i, i′, j, j′) /(64β)), δ (i, i′, j, j′) =√

(i− i′)2 + (j − j′)2, m =

1/4096, β = 1/33, γ2 = 1.91, µ = log(196)−γ2/2, and 1 = (1, . . . , 1)T

. For all Markovchains considered in this section, the starting state is x0 = µ1, unless otherwise stated.

We first attemp to run USN and HMH methods on π (x), but all the proposalsrejected. A natural idea borrowed from optimization literature is to truncate theNewton step if the target is locally very different from the Gaussian. Doing so wouldavoid the risk of being trapped in regions of low probability density due to full Newtonsteps [28]. Unfortunately, no matter how small the truncated Newton step is—in factup to machine zero step size is used—both USN and HMH reject all the proposals.This implies that straightforward application of optimization techniques to MCMCsimulations may not be successful. The reason is that the proposal consists of twoterms. The deterministic term H−1 (x)∇ log (π (x)) attempts to drift the proposalto region of high probability density, which is most likely the case if the target is lo-cally close to Gaussian. The stochastic term N

(0,H−1 (x)

)draws Gaussian random

variables described by the covariance matrix H−1 (x). Consequently, the proposal isrejected if either of the terms is unreasonable. The truncating strategy adjusts theformer to make it reasonable, but the latter needs to be adjusted as well if the targetis far from Gaussian. With this in mind and following the spirit of mMALA and

Analysis of MANA 19

MANA methods, we consider the following modification of the stochastic Newton(and hence HMH), abbrebriated as USNm,

yUSNm = x+ σ2H−1 (x)∇ log (π (x)) +N(0, σ2H−1 (x)

). (6.3)

This seems to be sensible since if one believes that the deterministic Newton stepshould be cut down due to non-Gaussianity of the target, the variance of the Gaus-sian term should be reduced too. The modification can be also viewed as the Euler-Maruyama discretization, with step size ∆t = σ2, of the following stochastic differen-tial equation

dx(t) = H−1 (x)∇ log (π (x)) dt+

√H−1 (x)dW (t),

for which it is not clear what the invariant density is even when H−1 (x) is a constantmatrix. On the other hand, it should be pointed out that USNm is almost identical toMANAm except for the unity factor in the drift term instead of 1

2 . Nevertheless, when

H−1 (x) is constant the stochastic differential equation corresponding to the MANAmapproach has π (x) as its invariant distribution; this seems to make MANAm chainsbetter than USN chains, as the following numerical results show.

Note that if one performs a similar optimal scaling analysis for the USN methodgiven by (6.3) using the same technique as in Section 5, he will find that the optimalscaling in this case is σ2 = `2n−1 since only C0 = 0 in (5.3). That is, USNm has asimilar asymptotic optimal scaling to that of MANAm, and hence both are generallyless efficient than MANA as the numerical results have confirmed.

We now attempt to run USNm using the proposal variance of 0.07, a default valuefor mMALA provided in the Matlab code of [16]. We observe that USNm rejects allthe proposal steps. On the other hand, if we use MANAm with σ2 = 0.07, the averageacceptance rate for each 50 MCMC steps is observed to be in between 0.46 and 0.8.This shows the importance of the factor 1

2 in the drift term of MANAm (6.1) asopposed to unity in the USNm approach (6.3). That is, convergence to the targetdensity on the differential level seems to facilitate the convergence on the discretelevel via MH algorithm.

In Figure 6.2 we present, in the first row from left to right, the true latent field,the true process, and the observed data to infer the latent field, respectively. Onthe third row we show the sample posterior mean, process, and variance, from left toright respectively, for the MANA method with σ2 = 0.07 (this is equivalent to taking`2 = 1.12), denoted as MANA1, with 5000 samples after discarding 1000 samples asburn-ins. We show similar results for MANAm and mMALA in the fourth and thefifth row. As can be seen, the posterior mean and process are almost identical forMANA1, MANAm, and mMALA. The variances are large at the top right corner sincethe data is very sparse there, and the MANAm result seems to have less variabilityelsewhere compared to others.

It turns out that proposal variance of 0.07 is quite small for the MANA methodsince the acceptance rate for each 50 MCMC steps is greater than 0.8 (in fact is isgreater than 0.9 most of the time). This suggests that we should use bigger proposalvariance to explore the stationary distribution faster. We choose σ2 = 0.12 to havethe average acceptance rate between 0.6 and 0.8, this corresponds to `2 = 1.9. Wedenote this particular combination as MANA2 and show the results in the secondrow of Figure 6.2. As can be observed, while the sample posterior mean and processare similar to other methods, there is less variability away from the top right corner


Tru

e la

ten

t fie

ldM

AN

A2

0.1

0.2

0.3

MA

NA

1

0.10.150.20.25

MA

NA

m

0.1

0.2

0.3

mM

AL

A

0.10.150.20.25

Fig. 6.2. Log-Gaussian Cox process example with MANA1 (σ2 = 0.07), MANA2 (σ2 = 0.12),MANAm (σ2 = 0.07), and mMALA (σ2 = 0.07). Here, the sample posterior means, processes, andvariances using 5000 MCMC samples after discarding 1000 samples as burn-ins.

compared to MANA1. Note that σ2 = 0.12 is too big for MANAm and mMALAmethods, and hence the resulting MCMC chains explore the invariant density veryslowly due to very small acceptance rate, as shown in Figure 6.3. In particular,the sample posterior, process, and variance of MANAm are completely wrong whilemMALA has large variance in patchy areas, and both are due to high correlations inthe samples.

Next, we consider the transient behavior of the scaling limit. Recall the result

Analysis of MANA 21M

AN

A2

0.1

0.2

0.3

MA

NA

m

0.005

0.01

0.015

0.02

0.025

mM

ALA

0.5

1

1.5

Fig. 6.3. Log-Gaussian Cox process example with MANA2 (σ2 = 0.12), MANAm (σ2 = 0.12),and mMALA (σ2 = 0.12). Here, the sample posterior means, processes, and variances using 5000MCMC samples after discarding 1000 samples as burn-ins.

in Section 5.3 that the scaling needs to be σ2 = O(n−1/2

)in the transient phase.

Theoretical results for standard Gaussian target in [12] show that the optimal scalingin stationary is σ2 = `24096−1/3 = 0.17 where `2 = 1.65, while it is σ2 = `24096−1/2 =0.03125 where `2 = 2 for transient phase. We use the following two starting states

1. x0 = µ1, and2. x0 = a, where a solves the equation

ti − exp (ai)− (ai − β) /γ2 = 0,

for which a is near the posterior mode.For these two starting states, the mMALA (which becomes MALA in this case) rejectsall the proposals using the stationary scaling σ2 = 0.17. We observe the same problemfor USN and MANAm. The MANA method, while rejecting all the proposals for thesecond starting state, converges to stationary after about 1400 iterations using the firststarting state as shown in Figure 6.4(a). As opposed to mMALA, USN, and MANAm,the MANA method is able to find the stationary region and then converges quickly,though staying put for a large number of initial iterations with a few unpredictablejumps (see [12] for a similar behavior when using MALA on a standard Gaussiantarget).

Now let us use the transient scaling σ2 = 0.03125, the USN method again rejectsall the samples while the others converge very fast to equilibrium after less than 100iterations using the first starting state. The second starting state is, however, moredifficult for all the methods. Numerical results show that both USN and MANAm alsofail to converge to the equilibrium though MANA and mMALA do converge. The


convergence of MANA and mMALA chains is shown in Figures 6.4(b) and 6.4(c),for which the trace of − log (π (x)) is plotted after 100 iterations. As can be seen,the MANA chain seems to get to the stationary faster. Both of the chains have theaverage acceptance rate about 0.96, which is expected since the transient scaling istoo conservative for the stationary phase.

0 1000 2000 3000 4000 50000

200

400

600

800

1000

1200

1400

1600trace

MA

NA

(a) MANA, σ2 = 0.17

500 1000 1500 2000 2500 3000 3500 4000 4500 5000−600

−500

−400

−300

−200

−100

0

100

200

300trace

MA

NA

(b) MANA, σ2 = 0.03125

500 1000 1500 2000 2500 3000 3500 4000 4500 5000−600

−500

−400

−300

−200

−100

0

100

200

300trace

mM

ALA

(c) MALA, σ2 = 0.03125

Fig. 6.4. The trace of − log (π (x)) for the Log-Gaussian Cox process example using a) MANAwith σ2 = 0.17 and the first starting state, b) MANA with σ2 = 0.03125 and the second startingstate, and c) mMALA with σ2 = 0.03125 and the second starting state.

Note that the severe converging problem in transient period with starting pointclose to the posterior mode is not surprising since it has already predicted by thetheory [12].

6.4. Inverse thermal fin problem. In this section, we consider a thermal finproblem adopted from [7, 27] with geometry given in Figure 6.5. The temperatureu distributed on the thermal fin Ω is governed by the following partial differentialequation

−∇ · (eγ∇u) = 0 in Ω, (6.4a)

−eγ∇u · n = Biu on ∂Ω \ ΓRoot, (6.4b)

−eγ∇u · n = −1 on ΓRoot, (6.4c)

where ΓRoot is the root of the thermal fin, Bi the Biot number, γ the logarithm ofthe conductivity, and n the unit normal vector.

−3 −0.5 0.5 3

0

0.75

1

1.75

2

2.75

3

3.75

4

γ1

γ2

γ3

γ4

γ0

↑Root

Fig. 6.5. A finite element mesh of thethermal fin and its dimensions.

We discretize (6.4) using the first orderstandard conforming finite element methodon the mesh in Figure 6.5 so that the dis-cretized equation has the form

A (γ)U = F, (6.5)

where U ∈ R1333 is the vector of nodal valuesof u, F ∈ R1333 the right-hand side result-ing from the boundary condition (6.4c), andA (γ) ∈ R1333×1333 the stiffness matrix. Sim-ilar to [27], we set γ = (γ0, γ1, γ2, γ3, γ4) ∈R5, where exp (γ0) is the thermal conductiv-ity of the base Ω0 = [−0.5, 0.5] × [0, 4] andexp (γi), i = 1, . . . , 4, the thermal conduc-tivities of the fins, respectively, as shown inFigure 6.5.

Analysis of MANA 23

The temperature is measured at 18 observation locations xobsi specified by redcircles in Figure 6.5 by the following additive noise model

uobsi = u(xobsi ) + ηi, , i = 1, . . . , 18,

with ηi ∼ N(0, 0.0012

)denoting the i.i.d. Gaussian noise. Thus, the likelihood model

can be written as

π(uobs|γ

)= N

(uobs, 0.0012I18

),

with I18 denoting the 18×18 identity matrix. To avoid the inverse crime, we synthesizethe observations uobs with

γ = (0, 0.25, 0.5, 0.75, 1) (6.6)

on a mesh that is twice finer than that given in Figure 6.5.We would like to solve the inverse problem of finding γ given uobs using the

Bayesian framework. To this end, we need to specify a prior distribution πprior (γ) forγ, and we choose

πprior (γ) ∝ exp

(−β

2γTγ

),

with β = 0.1 as the regularization constant.The Bayesian posterior solution is therefore given by

πpost(γ|uobs

)∝ exp

− 1

2× 0.0012

18∑i=1

[u(xobsi

)− uobsi

]2 − β

2γTγ

, (6.7)

where u(xobsi

)are nodal solutions extracted from the solution of (6.5). Using the

notation in Section 5.4, we have

Φ (γ) =1

2× 0.0012

18∑i=1

[u(xobsi

)− uobsi

]2.

Now, we attempt to apply the simplified MANA method (5.9) to the Bayesianposterior solution (6.7). Since this is a five-dimensional problem, it is not clearwhether the asymptotic optimal scaling result applies or not. The initial state γ0 =(0, 0, 0, 0, 0) is used. Let us take a conservative step σ2 = 2.23e − 06 so that theaverage acceptance rate is about 0.234, which is the best value for RWMH. We thenrun 105000 MCMC simulations (arbitrarily chosen) and discard the first 100000 runsas “burn-ins”. We denote this experiment as SimMANA1. On the first row of Figure6.6, we show the mean of γ, the trace plot of γ, and the autocorrelation function forγ5, respectively. As can be seen, the sample mean seems to be very accurate com-pared to the exact solution (6.6). However, it is not clear in the trace plot whetherthe MCMC chain has already converged. Moreover, the correlation length is large.Let us now take σ2 = 3.72e− 07 so that the average acceptance rate is about 0.573,which is the best value for the Langevin MCMC. We again discard the first 100000runs as “burn-ins” and denote this experiment as SimMANA2, for which we obtainsimilar results on the second row of Figure 6.6. As can be seen, the results do not seemto be better than those of SimpMANA1. Note that this does not conflict with the


discussion in Section 5.4. First, we have not yet been able to show the boundedness ofΦ (γ), and hence Lemma 5.5 may not be valid. Second, even when Lemma 5.5 holds,the result applies only for stationary phase, and it is not clear in Figure 6.6 when thechains get to stationary. This is a manifestation of the difficulty of inverse problemsand care must be taken when using MCMC.

1 2 3 4 5

0

0.25

0.5

0.75

1

Sim

MA

NA

1

Average of γi

1000 3000 5000

0

0.5

1

trace of γi

0 10 20−1

0

1

Lag

ACF of γ5

1 2 3 4 5

0

0.25

0.5

0.75

1

Sim

MA

NA

2

1000 3000 5000

0

0.5

1

0 10 20−1

0

1

Lag

1 2 3 4 5

0

0.25

0.5

0.75

1

MA

NA

1000 3000 5000

0

0.5

1

0 10 20−1

0

1

Lag

Fig. 6.6. Inverse thermal fin example with SimpMANA1 (σ2 = 2.23e − 06), SimpMANA2(σ2 = 3.7181e−07), and MANA (σ2 = 1.17). Here, we discard the first 100000 samples as burn-insfor both SimpMANA1 and SimpMANA2, but no burn-in for MANA.

Since this is a low dimensional problem with 5 parameters, it is feasible for us tocompute the Hessian of the posterior (6.7) exactly. Indeed, using an adjoint methodwe can compute the gradient ∇ log (πpost) by solving an adjoint of forward equation(6.4). On the other hand, the action of the Hessian ∇2 log (πpost) on a vector canbe computed efficiently by solving a pair of incremental forward and incrementaladjoint equations (see e.g. [8]). Therefore, computing the full Hessian amounts toacting ∇2 log (πpost) on the 5 × 5 identity matrix. We take `2 = 2, and hence σ2 =`2n−1/3 = 1.17. The sample mean, trace plots, and ACF of γ5 using a MANA chainwith 5000 samples with zero burn-in are shown on the third row of Figure 6.6. Theaverage acceptance rate in this case is about 0.74. As can be observed, the chainmixes very well and has correlation length about 4. This implies that using the actualgradient and Hessian of the posterior for the MANA method leads to MCMC chainswith faster convergence and excellent mixing.

We compute the sample variances and use them as uncertainty estimates asso-

Analysis of MANA 25

ciated with the sample mean and show them on the first row of Table 6.8. Theuncertainty for each thermal conductivity seems to be small and is about the sameorder as the noise standard deviation. This is not surprising since we have moreobservations (18) than the number of unknown parameters (5). This is most likelyan over-determined case, for which a linear theory in [41] shows that the posteriordistribution converges to the Dirac delta distribution concentrated at the maximuma posterior point in the limit of no noise. Let us confirm this numerically for ournonlinear inverse problem by additionally taking the noise standard deviation to be1.e − 4 and 1.e − 5. As can be observed in Table 6.8, the estimated uncertaintiesconverges to zero as the noise level diminishes.

Table 6.8Uncertainty estimates for the conductivity γ.

Thermal conductivityγ1 γ2 γ3 γ4 γ5

noise deviation1.e− 3 0.47e-3 1.12e-3 2.17e-3 1.55e-3 0.01e-31.e− 4 0.07e-4 0.12e-4 0.32e-4 0.17e-4 0.01e-41.e− 5 0.25e-5 0.04e-5 1.00e-5 0.17e-5 0.09e-5

7. Conclusions. Based on a connection between Euler-Maruyama discretiza-tion of the Langevin diffusion on Riemann manifold and gradient-based optimizationtechniques we have proposed a Metropolized adjusted Newton algorithm (MANA)algorithm for local Metropolis-Hastings Markov chain Monte Carlo simulation. TheMANA proposal consists of a deterministic and a stochastic parts. The former cor-responds to a Newton step which attempts to drift the current state to region ofhigher probability, and hence potentially increasing the acceptance probability. Thelatter is distributed by a Gaussian tailored by the local Hessian as the inverse covari-ance matrix. The proposal step is then corrected by the standard Metropolization toguarantee that the target density is the stationary distribution of the MANA chain.

The MANA method can be considered as an Euler-Maruyama discretization ofthe Langevin diffusion on Riemann manifold with piecewise constant Hessian of thenegative logarithm of the target density as the metric tensor. Its distinct featureis that the Hessian evaluated at a current state is freezed during the Metropoliza-tion. This facilitates not only theoretical analysis but also computation as we haveshown. We have provided a quite complete picture of the MANA method by study-ing its asymptotic convergence, geometric ergodicity, optimal scaling analysis in bothstationary and transient phases, and an extension to Bayesian inverse problems. Nu-merical results on various examples have confirmed our theoretical findings. We alsocompare the MANA method against other Hessian-aware approaches, and the numer-ical results show that the MANA method in general outperforms the others in givingMarkov chains with small burn-in and small correlation length. We have applied theMANA MCMC method to perform a simple uncertainty quantification on an inversethermal problem with five parameters.

Nevertheless, there are a few issues that need to be addressed in future studies.On the theoretical side, even though the numerical results suggest that the constant` in the optimal proposal scaling σ2 = `2n−1/3 be of order 1, we are not able toshow this rigorously except for i.i.d. Gaussian targets. Extension to Bayesian inverseproblems using the actual gradient and Hessian of the posterior density remains anopen question though numerical results support this idea. On the computational


side, computing the full Hessian is infeasible for large scale problems such as inverseproblems governed by PDEs, since it amounts to solving a pair incremental forwardand adjoint PDEs n times. Even so, the Hessian may not be always positive definite.In this case, one can replace the negative spectrum with a threshold as suggested in[22]. Alternatively, one can approximate the Hessian of the data misfit Φ (x) using itsGauss-Newton approximation, which is always semi-positive. Since the Gauss-NewtonHessian is typically a compact operator [9, 10], one can use low rank approximationtechniques to approximate it efficiently, and this is under investigation. Another wayto deal with the indefiniteness of the Hessian is to approximate it using a trust regionidea. Some work in this direction has been carried out [18,28], though a complete andeffective method still requires further study. Note that using any approximation tothe actual Hessian breaks down the theory and a new theory is necessary to establish(possibly new) optimal scaling for MANA chains.

REFERENCES

[1] Yves F. Atchade, An adaptive version for the Metropolis adjusted Langevin algorithm with atruncated drift, Methodology and Computing in Applied Probability, 8 (2006), pp. 235–254.

[2] Mylene Bedard and Jeffrey S. Rosenthal, Optimal scaling of Metropolis algorithms: head-ing toward general target distributions, The Canadian Journal of Statistics, 36 (2008),pp. 483–503.

[3] Alexandros Beskos, Gareth Roberts, and Andrew Stuart, Optimal scalings for localMetropolis-Hastings chains on nonproduct targets in high dimensions, The Annals of Ap-plied Probability, 19 (2009), pp. 863–898.

[4] A. Beskos and A.M. Stuart, MCMC methods for sampling function space, in Invited Lec-tures: Sixth International Congress on Industrial and Applied Mathematics, ICIAM 2007,R. Jeltsch and G. Wanner, eds., European Mathematical Society, 2009, pp. 337–364.

[5] Laird Arnault Breyer, Mauro Piccioni, and Sergio Scarlatti, Optimal scaling of MALAfor nonlinear regression, The Annals of Applied Probability, 14 (2004), pp. 1479–1505.

[6] Stephen P. Brooks, Markov chain Monte Carlo method and its application, Journal of theRoyal Statistical Society. Series D (The Statistician), 47 (1998), pp. 69–100.

[7] Tan Bui-Thanh, Model-Constrained Optimization Methods for Reduction of ParameterizedLarge-Scale Systems, PhD thesis, MIT, 2007.

[8] T. Bui-Thanh, C. Burstedde, O. Ghattas, J. Martin, G. Stadler, and L.C. Wilcox,Extreme-scale UQ for Bayesian inverse problems governed by PDEs, in Proceedings ofSC12, 2012. submitted (to Gordon Bell Prize track).

[9] Tan Bui-Thanh and Omar Ghattas, Analysis of the Hessian for inverse scattering problems.Part I: Inverse shape scattering of acoustic waves, Inverse Problems, 28 (2012), p. 055001.

[10] , Analysis of the Hessian for inverse scattering problems. Part II: Inverse medium scat-tering of acoustic waves, Inverse Problems, 28 (2012), p. 055002.

[11] Fergal P. Casey, Joshua J. Waterfall, Ryan N. Gutenkunst, Christopher R. My-ers, and James P. Sethna, Variational method for estimating the rate of convergence ofMarko-chain Monte Carlo algorithms, Phy. Rev. E., 78 (2008), p. 046704.

[12] Ole F. Christensen, Gareth O. Roberts, and Jeffrey S. Rosenthal, Scaling limits forthe transient phase of local Metropolis–Hastings algorithms, Journal of Royal StatisticalSociety. Series B (Statistical Methodology), 67 (2005), pp. 253–268.

[13] Mary Kathryn Cowles and Bradley P. Carlin, Markov chain Monte Carlo convergencediagnostics: A comparative review, Journal of the American Statistical Association, 91(1996), pp. 883–904.

[14] Rick Durret, Probability: theory and examples, Cambidge University Press, 2010.[15] A Gelman, G. O. Roberts, and W. R. Gilks, Efficient Metropolis jumping rules, in Bayesian

Statistics 5, J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith, eds., OxfordUniv Press, 1996, pp. 599–607.

[16] Mark Girolami and Ben Calderhead, Riemann manifold Langevin and Hamiltonian MonteCarlo methods, Journal of the Royal Statistical Society: Series B (Statistical Methodology),73 (2011), pp. 123–214.

[17] W. K. Hastings, Monte Carlo sampling methods using Markov chains and their applications,Biometrika, 57 (1970), pp. 97–109.

Analysis of MANA 27

[18] Edward Herbst, Gradient and Hessian-based MCMC for DSGE models, (2010). Unpublishedmanuscript.

[19] Soren F. Jarner and Ernst Hansen, Geometric ergodicity of Metropolis algorithms, Stochas-tic Processes and their Applications, 85 (2000), pp. 341–361.

[20] Tristan Marshall and Gareth O. Roberts, An adaptive approach to Langevin MCMC,Statistics and Computing, (2011), pp. 1–17.

[21] James Martin, Private communication, (2012).[22] James Martin, Lucas C. Wilcox, Carsten Burstedde, and Omar Ghattas, A stochastic

Newton MCMC method for large scale statistical inverse problems with application toseismic inversion, SIAM Journal on Scientific Computing, (2012). To appear.

[23] Nicholas Metropolis, Arianna W. Rosenbluth, Marshall N. Rosenbluth, Augusta H.Teller, and Edward Teller, Equation of state calculations by fast computing machines,The Journal of Chemical Physics, 21 (1953), pp. 1087–1092.

[24] Sean Meyn and Richard L. Tweedie, Markov chains and stochastic stability, Springer-Verlag,London, 1993.

[25] Jesper Moller, Anne Randi Syversveen, and Rasmus Plenge Waagepetersen, Log GaussCox processes, Scandinavian Journal of Statistics, 25 (1998), pp. 451–482.

[26] Jorge Nocedal and Stephen J. Wright, Numerical Optimization, Springer Verlag, Berlin,Heidelberg, New York, second ed., 2006.

[27] MIT OpenCourseWare, Numerical methods for partial differential equations, 2003.http://ocw.mit.edu/OcwWeb/Aeronauticsand-Astronautics/16-920JNumerical-Methods-for-Partial-Differential-EqationsSpring2003/CourseHome/.

[28] Yuan Qi and Thomas P. Minka, Hessian-based Markov chain Monte-Carlo algorithms, inFirst Cape Cod Workshop on Monte Carlo Methods, Cape Cod, MA, USA, September2002.

[29] Christian P. Robert and George Casella, Monte Carlo Statistical Methods (Springer Textsin Statistics), Springer-Verlag New York, Inc., Secaucus, NJ, USA, 2005.

[30] Gareth O. Roberts, A. Gelman, and W. R. Gilks, Weak convergence and optimal scaling ofrandom walk Metropolis algorithms, The Annals of Applied Probability, 7 (1997), pp. 110–120.

[31] Gareth O. Roberts and Jeffrey S. Rosenthal, Optimal scaling of discrete approximationsto Langevin diffusions, J. R. Statist. Soc. B, 60 (1997), pp. 255–268.

[32] Gareth O. Roberts and Jeffrey S. Rosenthal, Optimal scaling for various metropolis-hastings algorithms, Statistical Science, 16 (2001), pp. pp. 351–367.

[33] Gareth O. Roberts and Jeffrey S. Rosenthal, General state space Markov chains andMCMC algorithms, Probability Surveys, 1 (2004), pp. 20–71.

[34] Gareth O. Roberts and Osnat Stramer, Langevin diffussions and Metropolis-Hastings al-gorithms, Methodology and Computing in Applied Probability, 4 (2003), pp. 337–357.

[35] Gareth O. Roberts and Richard L. Tweedie, Exponential convergence of Langevin distri-butions and their discrete approximations, Bernoulli, 2 (1996), pp. 341–363.

[36] , Geometric convergence and central limit theorems for multidimensional Hastings andMetropolis algorithms, Biometrika, 83 (1996), pp. 95–110.

[37] Jeffrey Rosenthal, Optimal proposal distributions and adaptive MCMC, in Handbook ofMarkov chain Monte Carlo, Steve Brooks, Andrew Gelman, Galin L. Jones, and Xiao-LiMeng, eds., CRC Press, 2011, ch. 4, pp. 93–112.

[38] P. J. Rossky, J. D. Doll, and H. L. Friedman, Brownian dynamics as smart Monte Carlosimulation, J. Chem. Phys., 69 (1978), p. 4268.

[39] Chris Sherlock and Gareth O. Roberts, Optimal scaling of the random walk Metropolison elliptically symmetric unimodal targets, Bernoulli, 15 (2009), pp. 774–798.

[40] A.M. Stuart, P. Wiberg, and J. Voss, Conditional path sampling of SDEs and the LangevinMCMC method, Communications in Mathematical Sciences, 2 (2004), pp. 685–697.

[41] Andrew M. Stuart, Inverse problems: A Bayesian perspective, Acta Numerica, 19 (2010),pp. 451–559.

[42] Luke Tierney, Markov chains for exploring posterior distributions, The Annals of Statistics,22 (1994), pp. 1701–1762.

[43] Cornelia Vacar, Jean-Francois Giovannelli, and Yannick Berthoumieu, Langevin andHessian with Fisher approximation stochastic sampling for parameter estimation of struc-tured covariance, IEEE International Conference on Acoustics, Speech and Signal Process-ing (ICASSP), (2011), pp. 3964–3967.

[44] Yichuan Zhang and Charles Sutton, Quasi-Newton methods for Markov chain MonteCarlo, in Advances in Neural Information Processing Systems, J. Shawe-Taylor, R. S.Zemel, P. Bartlett, F. C. N. Pereira, and K. Q. Weinberger, eds., vol. 24, 2011.

ices report 12-25 a metropolized adjusted newton algorithm ... · ples from target density ˇ(x),...

Documents