langevin mcmc: theory and methods bayesian computation opening workshop · 2018-08-27 · 1/84...

1/84

Langevin MCMC: Theory and MethodsBayesian Computation Opening Workshop

A. Durmus1, N. Brosse2, E. Moulines2, M. Pereyra3, S. Sabanis4

1ENS Paris-Saclay

2Ecole Polytechnique

3Heriot-Watt University

4University of Edinburgh

IMS 2018 1 / 84

2/84

Outline

1 Motivation and setting

2 The ULA algorithm

3 The Stochastic Gradient Langevin dynamics (SGLD)

4 Strongly log-concave distribution

5 Convex and Super-exponential densities

6 Non-smooth potentials

7 Logconcave densities with constrained domains

8 Normalizing constants of log-concave densities

IMS 2018 2 / 84

3/84

Introduction

Sampling distribution over high-dimensional state-space has recentlyattracted a lot of research efforts in computational statistics and machinelearning community...

Applications (non-exhaustive)

1. Bayesian inference for high-dimensional models,2. Bayesian inverse problems (e.g., image restoration and deblurring),3. Aggregation of estimators and experts,4. Bayesian non-parametrics.

Most of the sampling techniques known so far do not scale tohigh-dimension... Challenges are numerous in this area...

IMS 2018 3 / 84

4/84

Bayesian stuff

A Bayesian model is specified by

(i) the likelihood of the observed data : D ∼ L(·|x)(ii) a prior distribution π0 on the parameter space x ∈ X ⊂ Rd

The inference is based on the posterior distribution:

π(x |D) =π0(x)L(D|x)∫L(D|u)π0(u)du

.

In most cases the normalizing constant is not tractable:

π(x |D) ∝ π0(x)L(D|x) .

IMS 2018 4 / 84

5/84

An example: Bayesian analysis of logistic regression

Likelihood: Binary regression set-up in which the binary observations(responses) Rini=1 are conditionally independent Bernoulli random variableswith success probability F (xTZi )ni=1, where

1. Zi is a d dimensional vector of known covariates,2. x is a d dimensional vector of unknown regression coefficients3. F is the link function: in logistic regression, F is the standard logistic

cumulative distribution function:

F (t) = et/(1 + et)

IMS 2018 5 / 84

6/84

An example: Bayesian analysis of logistic regression

Possible choices for the prior (tons of theory behind these, this is not blackmagic !)

Gaussian prior π0(x) ∝ exp(− 1

2x′Σ−1x

)LASSO prior π0(x) = exp

(−∑d

i=1 αi |xi |)

Many more sophisticated ones (spike-and-slab, horseshoe,...) and otherglobal-local shrinkage priors...

The posterior of x is up to a proportionality constant given by

π(x |(R,Z )) ∝n∏

i=1

FRi (x ′Zi )(1− F (x ′Zi ))1−Riπ0(x)

IMS 2018 6 / 84

7/84

Bayesian statistics are out of the game ?

While optimization-based algorithms for the extremely popular Lasso andelastic net procedures can scale to dimension in the hundreds of thousands,corresponding Bayesian methods that use Markov chain Monte Carlo(MCMC) for computation are limited to problems at least an order ofmagnitude smaller.

The promise of scaling MCMC to large datasets has thus far not beenrealized, owing to the expensive computations often involved in MCMCalgorithms and slow mixing of the corresponding Markov chains.

(long term) Objective:

- Contribute to fill the gap between optimization and simulation.- Develop sampling methods which can provably work in high-dimension

(more precise statements later)

For logistic regression with horse-shoe prior, on going work with JamesJohndrow

IMS 2018 7 / 84

http://jamesjohndrow.com/?page_id=113

http://jamesjohndrow.com/?page_id=113

8/84

Outline


2 The ULA algorithm







IMS 2018 8 / 84

9/84

Framework

Denote by π a target density w.r.t. the Lebesgue measure on Rd , known upto a normalisation factor

x 7→ e−U(x)/

∫Rd

e−U(y)dy ,

Assumption: U is L-smooth : continuously differentiable and there exists aconstant L such that for all x , y ∈ Rd ,

‖∇U(x)−∇U(y)‖ ≤ L‖x − y‖ .

Note: this condition can be removed by taming the gradient (Brosse,Durmus, M., Sabanis, 2018, to appear SPA 2019)

IMS 2018 9 / 84

10/84

(Overdamped) Langevin diffusion

Langevin SDE:dYt = −∇U(Yt)dt +

√2dBt ,

where (Bt)t≥0 is a d-dimensional Brownian Motion.

π(x) ∝ exp(−U(x)) is the unique invariant probability measure.

Notation: (Pt)t≥0 the Markov semigroup associated to the Langevindiffusion:

IMS 2018 10 / 84

11/84

Discretized Langevin diffusion

Idea: Sample the diffusion paths, using the Euler-Maruyama (EM) scheme:

Xk+1 = Xk − γk+1∇U(Xk) +√

2γk+1Zk+1

where

- (Zk)k≥1 is i.i.d. N (0, Id)- (γk)k≥1 is a sequence of stepsizes, which can either be held constant or

be chosen to decrease to 0 at a certain rate

Closely related to the gradient descent algorithm.

This algorithm is referred to as the Unadjusted Langevin Algorithm (ULA) inBayesian statistics or Langevin Monte Carlo (LMC) in machine learning.

IMS 2018 11 / 84

12/84

Discretized Langevin diffusion: constant stepsize

When the stepsize is held constant, i.e. γk = γ, then (Xk)k≥1 is anhomogeneous Markov chain with Markov kernel Rγ

Under some appropriate conditions, Rγ is irreducible, positive recurrent unique invariant distribution πγ which does not coincide with the targetdistribution π.

Questions:

For a given precision ε > 0, how to chose the stepsize γ > 0 and thenumber of iterations n so that : ‖δxRn

γ − π‖TV ≤ ε ?How to select the starting point x cleverly ?How to quantify the distance between πγ and π ?

IMS 2018 12 / 84

13/84

Discretized Langevin diffusion: decreasing stepsize

When (γk)k≥1 is nonincreasing and non constant, (Xk)k≥1 is aninhomogeneous Markov chain associated with the kernels (Rγk )k≥1.

Notation: Qpγ is the composition of Markov kernels

Qpγ = Rγ1Rγ2 . . .Rγp

With this notation, Ex [f (Xp)] = δxQpγ f .

Questions:

- Convergence : Can the step sizes be selected so that ‖δxQpγ − π‖TV → 0

?- Optimal choice of simulation parameters : How to select the number of

iterations to achieve : ‖δxQpγ − π‖TV ≤ ε starting from a given point x

- Should we use fixed or decreasing step sizes ?

IMS 2018 13 / 84

14/84

Some (very incomplete) references: Early references

(i) Statistical physics: Parisi, 1981, Correlation function and ComputerSimulations, Nuclear Physics.

(ii) Bayesian statistics: Grenander and Miller (in discussion Besag,Representation of knowledge in Complex Systems, JRSS B). First theoreticalresults given by Roberts and Tweedie, 1996, Exponential Convergence ofLangevin Distributions and Their Discrete Approximations, Bernoulli, Stramerand Tweedie, Langevin-type models. I. Diffusions with given stationarydistributions and their discretizations., MCAP, 1999

(iii) Most of these results are qualitative (e.g. conditions upon which the sampleris geometrically ergodic).

IMS 2018 14 / 84

15/84

Some (very incomplete) references: Euler discretisation

(i) Talay, D. and Tubaro, L. Expansion of the global error for numerical schemessolving stochastic differential equations, Stochastic Anal. Appl., 483–509(1991)

(ii) Lamberton, D.; Pages, G. Recursive computation of the invariant distributionof a diffusion: the case of a weakly mean reverting drift. Stoch. Dyn. 3(2003), no. 4, 435–451.

(iii) Lemaire, V. Behavior of the Euler scheme with decreasing step in adegenerate situation. ESAIM Probab. Stat. 11 (2007), 236–247.

(iv) Lemaire, V.; Menozzi, S. On some non asymptotic bounds for the Eulerscheme. Electron. J. Probab. 15 (2010), no. 53, 1645–1681.

IMS 2018 15 / 84

16/84

Outline


2 The ULA algorithm







IMS 2018 16 / 84

17/84

SGLD Algorithm

Objective: posterior inference on large scale datasets (many machine learningapplications).

the target π is the posterior distribution in a Bayesian inference problem withprior density π0(x) and a large number N 1 of i.i.d. observations Di withlikelihoods L(Di |x):

π(x) = π0(x)N∏i=1

L(Di |x) .

the cost of one iteration is Nd which is prohibitively large for massivedatasets.

Notations:

negated log-likelihood for a single observation: Ui (x) = − log(L(Di |x))

for i ∈ 1, . . . ,N, negated likelihood of N observations: U =∑N

i=0 Ui .negated log-prior: U0(x) = − log(π0(x))

IMS 2018 17 / 84

18/84

SGLD Algorithm

Welling and Teh (2011) suggested to replace ∇U with an unbiased estimate

∇U0 + (N/p)∑i∈S

∇Ui

where S is a minibatch of 1, . . . ,N with replacement of size p.

A single update of SGLD is then given by

Xk+1 = Xk − γ

∇U0(Xk) +N

p

∑i∈Sk+1

∇Ui (Xk)

+√

2γZk+1 .

The idea of using only a fraction of the observations to compute an unbiasedestimate of the gradient at each iteration comes from Stochastic GradientDescent (SGD) which is a popular algorithm in machine learning to minimizethe potential U.

IMS 2018 18 / 84

19/84

SGLD and control variates

SGLD scales to large datasets by using noisy gradients calculated using amini-batch or subset of the dataset.

However, the high variance inherent in these noisy gradients degradesperformance and leads to slower mixing.

Idea: Use variance reduction for the unbiased estimator of the gradient.

the Stochastic Variance Reduced Gradient Langevin Dynamics (SVR-GLD)(Dubey et al, 2016):

Xk+1 = Xk − γ

∇U0(Xk)− g0,k +N

p

∑i∈Sk+1

∇Ui (Xk)− gi,k

+√

2γZk+1 .

where gi,k are updated every m iterations, i.e.

gi,k =

gi,k−1 k 6= 0 mod m

∇Ui (Xk) otherwise

IMS 2018 19 / 84

20/84

Some (very incomplete) references: SGLD, SVR-GLD

(i) Welling, M. and Teh, Y.W., Bayesian learning via stochastic gradientLangevin dynamic, ICML, 2011

(ii) Vollmer, S.; Zygalakis, K.; Teh, Y. W. Exploration of the (non-)asymptoticbias and variance of stochastic gradient Langevin dynamics, J. Mach. Learn.Res. 17 (2016),

(iii) Teh, Y.W.; Thiery, A.; Vollmer, S., Consistency and fluctuations forstochastic gradient Langevin dynamics J. Mach. Learn. Res. 17 (2016)

(iv) Dubey, A; Reddi, S; Williamson, S. ; Poczos, B.; Smola, A.; Xing E.,Variance reduction in stochastic gradient Langevin dynamics, Advances inneural information processing systems,

(v) Baker, J.; Fearnhead, P.; Fox, E. B.; Nemeth, C., Control variates forstochastic gradient MCMC. (2017).

(vi) Dalalyan, A., Further and stronger analogy between sampling andoptimization: Langevin Monte Carlo and gradient descent, Proc. of the 2017Conference on Learning Theory, volume 65

IMS 2018 20 / 84

21/84

Outline


2 The ULA algorithm







IMS 2018 21 / 84

22/84

Strongly convex potential

Assumption: U is L-smooth and m-strongly convex

‖∇U(x)−∇U(y)‖ ≤ L ‖x − y‖

〈∇U(x)−∇U(y), x − y〉 ≥ m ‖x − y‖2.

Outline of the proof

1. Control in Wasserstein distance of the laws of the Langevin diffusion andits discretized version.

2. Relating Wasserstein distance result to total variation.

Key technique: (Synchronous and Reflection) coupling !

IMS 2018 22 / 84

23/84

Wasserstein distance

Definition 1

Let ξ, ξ′ be two probability measures on (Rd ,B(Rd)). The Wasserstein distance oforder 2 is given by

W 22 (ξ, ξ′) = inf

ζ∈C(ξ,ξ′)

∫X×X

‖x − x ′‖2ζ(dxdx ′) ,

= infζ∈C(ξ,ξ′)

E[‖X − X ′‖2

],

where (X ,X ′) ∼ ζ.

IMS 2018 23 / 84

24/84

Wasserstein distance convergence

Theorem 2

Assume that U is L-smooth and m-strongly convex.

Then, for all x , y ∈ Rd and t ≥ 0,

W2 (δxPt , δyPt) ≤ e−mt ‖x − y‖

For all x ∈ Rd and t ≥ 0,

W2 (δxPt , δyPt) ≤ e−mt‖x − x?‖+

√d/m

,

where x? = arg minRd U.

The contraction depends only on the strong convexity constant.

IMS 2018 24 / 84

25/84

Synchronous Coupling

dYt = −∇U(Yt)dt +

√2dBt ,

dYt = −∇U(Yt)dt +√

2dBt ,where (Y0, Y0) = (x , y).

This SDE has a unique strong solution (Yt , Yt)t≥0. Since

dYt − Yt = −∇U(Yt)−∇U(Yt)

dt

The product rule for semimartingales imply

d∥∥∥Yt − Yt

∥∥∥2

= −2⟨∇U(Yt)−∇U(Yt),Yt − Yt

⟩dt .

IMS 2018 25 / 84

26/84

Synchronous Coupling

∥∥∥Yt − Yt

∥∥∥2

=∥∥∥Y0 − Y0

∥∥∥2

− 2

∫ t

0

⟨(∇U(Ys)−∇U(Ys)),Ys − Ys

⟩ds ,

Since U is strongly convex 〈∇U(y)−∇U(y ′), y − y ′〉 ≥ m ‖y − y ′‖2which implies∥∥∥Yt − Yt

∥∥∥2

≤∥∥∥Y0 − Y0

∥∥∥2

− 2m

∫ t

0

∥∥∥Ys − Ys

∥∥∥2

ds .

Gromwall inequality: ∥∥∥Yt − Yt

∥∥∥2

≤∥∥∥Y0 − Y0

∥∥∥2

e−2mt

IMS 2018 26 / 84

27/84

Contraction property of the discretization

Theorem 3

Assume that U is L-smooth and m-strongly convex. Then,

(i) Let (γk)k≥1 be a nonincreasing sequence with γ1 ≤ 2/(m + L). For allx , y ∈ Rd and ` ≥ n ≥ 1,

W2(δxQn,`γ , δyQ

n,`γ ) ≤

∏k=n

(1− κγk)

1/2

‖x − y‖ .

where κ = 2mL/(m + L).

(ii) For any γ ∈ (0, 2/(m + L)), for all x ∈ Rd and n ≥ 1,

W2(δxRnγ , πγ) ≤ (1− κγ)n/2

‖x − x?‖+

√2κ−1d

.

IMS 2018 27 / 84

28/84

A coupling proof (I)

Objective compute bound for W2(δxQnγ , π)

Since πPt = π for all t ≥ 0, it suffices to get bounds of the Wassersteindistance

W2

(δxQ

nγ , πPΓn

)- Γn =

∑nk=1 γk ,

- δxQnγ : law of the discretized diffusion,

- πPγn = π, where (Pt)t≥0 is the semi group of the diffusion

Idea ! synchronous coupling between the diffusion and the interpolation ofthe Euler discretization.

IMS 2018 28 / 84

29/84

A coupling proof (II)

For all n ≥ 0 and t ∈ [Γn, Γn+1) byYt = YΓn −

∫ t

Γn∇U(Ys)ds +

√2(Bt − BΓn)

Yt = YΓn −∫ t

Γn∇U(YΓn)ds +

√2(Bt − BΓn) ,

with Y0 ∼ π and Y0 = xFor all n ≥ 0,

W 22

(δxPΓn , πQ

nγ

)≤ E[‖YΓn − YΓn‖2] ,

IMS 2018 29 / 84

30/84

Explicit bound in Wasserstein distance

Theorem 4

Assume that U is m-strongly convex and L-smooth. Let (γk)k≥1 be a nonincreasingsequence with γ1 ≤ 1/(m + L). Then

W 22 (δxQ

nγ , π) ≤ u(1)

n (γ)‖x − x?‖2

+ d/m

+ u(2)n (γ) ,

where u(1)n (γ) = 2

n∏k=1

(1− κγk) with κ = mL/(m + L) and

u(2)n (γ) = 2

dL2

m

n∑i=1

[γ2i c(m, L, γi )

n∏k=i+1

(1− κγk)

].

Can be sharpened if U is three times continuously differentiable and there exists L suchthat for all x , y ∈ Rd ,

∥∥∇2U(x)−∇2U(y)∥∥ ≤ L ‖x − y‖.

IMS 2018 30 / 84

31/84

Results

Fixed step size For any ε > 0, one may choose γ so that

W2

(δx?R

pγ , π

)≤ ε in p = O(dε−1) iterations

where x? is the unique maximum of π

Decreasing step size with γk = γ1k−α, α ∈ (0, 1),

W2

(δx?Q

nγ , π)

= dO(n−α) .

Our results are tight (check with U(x) = (1/2)‖x‖2).

IMS 2018 31 / 84

32/84

From the Wasserstein distance to the TV

Theorem 5

Assume that U is strongly convex.

(i) For all x , y ∈ Rd ,

‖Pt(x , ·)− Pt(y , ·)‖TV ≤ 1− 2Φ

− ‖x − y‖√

(4/m)(e2mt − 1)

,

where Φ is the standard Gaussian cumulative distribution function.

(ii) For any µ, ν and t > 0,

‖µPt − νPt‖TV ≤ 21/2 W1(µ, ν)/

(π1/2√

(4/m)(e2mt − 1)) .

Use reflection coupling (Lindvall and Rogers, 1986)

IMS 2018 32 / 84

33/84

Hints of Proof I

dXt = −∇U(Xt)dt +

√2dBd

t

dYt = −∇U(Yt)dt +√

2(Id−2eteTt )dBd

t ,where et = e(Xt − Yt)

with X0 = x , Y0 = y , e(z) = z/ ‖z‖ for z 6= 0 and e(0) = 0 otherwise.Define the coupling time Tc = infs ≥ 0 | Xs 6= Ys. By construction Xt = Yt

for t ≥ Tc .

Bdt =

∫ t

0

(Id−2eseTs )dBd

s

is a d-dimensional Brownian motion, therefore (Xt)t≥0 and (Yt)t≥0 are weaksolutions to Langevin diffusions started at x and y , respectively. Then byLindvall’s inequality, for all t > 0 we have

‖Pt(x , ·)− Pt(y , ·)‖TV ≤ P (Xt 6= Yt) .

IMS 2018 33 / 84

34/84

Hints of Proof II

For t < Tc (before the coupling time)

dXt − Yt = −∇U(Xt)−∇U(Yt) dt + 2√

2etdB1t .

Using Ito’s formula

‖Xt − Yt‖ = ‖x − y‖ −∫ t

0

〈∇U(Xs)−∇U(Ys), es〉ds + 2√

2B1t

≤ ‖x − y‖ −m

∫ t

0

‖Xs − Ys‖ds + 2√

2B1t .

and Gronwall’s inequality implies

‖Xt − Yt‖ ≤ e−mt ‖x − y‖+ 2√

2B1t −m2

√2

∫ t

0

B1s e−m(t−s)ds .

IMS 2018 34 / 84

35/84

Hint of Proof III

Therefore by integration by part, ‖Xt − Yt‖ ≤ Ut where (Ut)t∈(0,Tc ) is theone-dimensional Ornstein-Uhlenbeck process defined by

Ut = e−mt ‖x − y‖+ 2√

2e−mt

∫ t

0

emsdB1s

Therefore, for all x , y ∈ Rd and t ≥ 0, we get

P(Tc > t) ≤ P(

min0≤s≤t

Ut > 0

).

Finally the proof follows from the tail of the hitting time of (one-dimensional) OU(see Borodin and Salminen,2002).

IMS 2018 35 / 84

36/84

From the Wasserstein distance to the TV (II)

‖Pt(x , ·)− Pt(y , ·)‖TV ≤‖x − y‖√

(2π/m)(e2mt − 1)

Consequences:

1. (Pt)t≥0 converges exponentially fast to π in total variation at a rate e−mt .

2. For all f : Rd → R, measurable and sup |f | ≤ 1, then the functionx 7→ Pt f (x) is Lipschitz with Lipshitz constant smaller than

1/√

(2π/m)(e2mt − 1) .

IMS 2018 36 / 84

37/84

Explicit bound in total variation

Theorem 6

Assume U is L-smooth and strongly convex. Let (γk)k≥1 be a nonincreasingsequence with γ1 ≤ 1/(m + L).

(Optional assumption) U ∈ C 3(Rd) and there exists L such that for allx , y ∈ Rd :

∥∥∇2U(x)−∇2U(y)∥∥ ≤ L ‖x − y‖.

Then there exist sequences u(1)n (γ), n ∈ N and u(1)

n (γ), n ∈ N such that for allx ∈ Rd and n ≥ 1,

‖δxQnγ − π‖TV ≤ u(1)

n (γ)‖x − x?‖2 + d/m

+ u(2)

n (γ) .

IMS 2018 37 / 84

38/84

Constant step sizes

For any ε > 0, the minimal number of iterations to achieve‖δxRp

γ − π‖TV ≤ ε is

p = O(d log(d)ε−1 |log(ε)|) .

For a given stepsize γ, letting p → +∞, we get:

‖πγ − π‖TV ≤ Cγ |log(γ)| .

IMS 2018 38 / 84

39/84

Outline


2 The ULA algorithm







IMS 2018 39 / 84

40/84

Convergence of the Euler discretization

Assume one of the following conditions:

There exist α > 1, ρ > 0 and Mρ ≥ 0 such that for all y ∈ Rd , ‖y‖ ≥ Mρ:

〈∇U(y), y〉 ≥ ρ ‖y‖α .

U is convex.

Results

If limγk→+∞ γk = 0, and∑

k γk = +∞ then

limp→+∞

‖δxQpγ − π‖TV = 0 .

‖πγ − π‖TV ≤ C√γ

IMS 2018 40 / 84

41/84

Target precision ε: the convex case

Setting U is convex. Constant stepsize

Optimal stepsize γ and number of iterations p to achieve ε-accuracy in TV:

‖δxRpγ − π‖TV ≤ ε .

d ε Lγ O(d−3) O(ε2/ log(ε−1)) O(L−2)

p O(d5) O(ε−2 log2(ε−1)) O(L2)

Table: In the strongly convex case, d !

IMS 2018 41 / 84

42/84

Strongly convex outside a ball potential

U is convex everywhere and strongly convex outside a ball, i.e. there existR ≥ 0 and m > 0, such that for all x , y ∈ Rd , ‖x − y‖ ≥ R,

〈∇U(x)−∇U(y), x − y〉 ≥ m ‖x − y‖2.

Eberle (2015) established that the convergence in the Wasserstein distancedoes not depends on the dimension.

Durmus, Moulines (2016) established that the convergence of the semi-groupin TV to π does not depends on the dimension but just on R boundswhich scale nicely in the dimension.

IMS 2018 42 / 84

43/84

Dependence on the dimension

Setting U is convex and strongly convex outside a ball. Constant stepsize


‖δxRpγ − π‖TV ≤ ε .

d ε L m Rγ O(d−1) O(ε2/ log(ε−1)) O(L−2) O(m) O(R−4)

p O(d log(d)) O(ε−2 log2(ε−1)) O(L2) O(m−2) O(R8)

Table: Of course, there is a price to pay in the dependence in R

IMS 2018 43 / 84

44/84

Target precision ε: the convex case

Setting U is convex. Constant stepsize


‖δx?Rpγ − π‖TV ≤ ε .

starting point x? ∈ arg minRd U

d ε Lγ O(d−3) O(ε2/ log(ε−1)) O(L−2)

p O(d5) O(ε−2 log2(ε−1)) O(L2)

IMS 2018 44 / 84

45/84

Strongly convex outside a ball potential

U is convex everywhere and strongly convex outside a ball, i.e. there existR ≥ 0 and m > 0, such that for all x , y ∈ Rd , ‖x − y‖ ≥ R,

〈∇U(x)−∇U(y), x − y〉 ≥ m ‖x − y‖2.

Eberle (2015) established that the convergence in the Wasserstein distancedoes not depends on the dimension.

Durmus, Moulines (2016) established that the convergence of the semi-groupin TV to π does not depends on the dimension but just on R new boundswhich scale nicely in the dimension.

IMS 2018 45 / 84

46/84

Dependence on the dimension

Setting U is convex and strongly convex outside a ball. Constant stepsize


‖δx?Rpγ − π‖TV ≤ ε .

starting point x? ∈ arg minRd U

d ε L m Rγ O(d−1) O(ε2/ log(ε−1)) O(L−2) O(m) O(R−4)

p O(d log(d)) O(ε−2 log2(ε−1)) O(L2) O(m−2) O(R8)

IMS 2018 46 / 84

47/84

Outline


2 The ULA algorithm







IMS 2018 47 / 84

48/84

Non-smooth potential

Problem: U is convex but not continuously differentiable ? More precisely,assume that

π ∝ e−U , U = f + g ,

where f is convex and smooth and g is convex but not smooth.

Applications :

(i) LASSO g(x) =∑d

i=1 |xi |, fused-LASSO models,

g(x) =d∑

i=1

|xi − xi−1|

.(ii) Constrained domain,

g(x) = ι∑ni=1 |xi |≤κ with ιK

def=

+∞ if x /∈ K,

0 if x ∈ K .,

IMS 2018 48 / 84

49/84

Non-smooth potential

Idea: apply EM discretization, (Durmus,M.,Pereyra, 2018) to anappropriately regularized version of g

(i) the convexity of U is preserved(ii) the regularisation of U is continuously differentiable and gradient

Lipschitz(iii) the resulting approximation is close to π (e.g. in total variation norm)

IMS 2018 49 / 84

50/84

Moreau-Yosida regularization

Assume that g : Rd → (−∞,+∞] is a l.s.c convex function and let λ > 0.

The λ-Moreau-Yosida envelope gλ : Rd → R is defined for all x ∈ Rd by

gλ(x) = infy∈Rd

g(y) + (2λ)−1 ‖x − y‖2

≤ g(x) .

For every x ∈ Rd , the minimum is achieved at a unique point, proxλg (x) ,which is characterized by the inclusion

x − proxλg (x) ∈ γ∂g(proxλg (x)) .

The Moreau-Yosida envelope is a regularized version of g , whichapproximates g from below.

IMS 2018 50 / 84

51/84

Properties of proximal operators

As λ ↓ 0, converges gλ converges pointwise g , i.e. for all x ∈ Rd ,

gλ(x) ↑ g(x) , as λ ↓ 0 .

The function gλ is convex and continuously differentiable

∇gλ(x) = λ−1(x − proxλg (x)) .

The proximal operator is a monotone operator, for all x , y ∈ Rd ,⟨proxλg (x)− proxλg (y), x − y

⟩≥ 0 ,

which implies that the Moreau-Yosida envelope is L-smooth: for allx , y ∈ Rd ∥∥∇gλ(x)−∇gλ(y)

∥∥ ≤ λ−1 ‖x − y‖ .

IMS 2018 51 / 84

52/84

Moreau-Yosida regularization

If g is not differentiable, but the proximal operator associated with g isavailable, its λ-Moreau Yosida envelope gλ can be considered.

This leads to the approximation of the potential Uλ : Rd → R defined for allx ∈ Rd by

Uλ(x) = f (x) + gλ(x) .

Question: Does it make some sense to use Uλ for targeting π ∝ e−U ?

IMS 2018 52 / 84

53/84

Assumptions

H1

π ∝ e−U , U = f + g

f : Rd → R and g : Rd → (−∞,+∞] are convex

f is continuously differentiable and gradient Lipschitz with Lipschitz constantLf , i.e. for all x , y ∈ Rd

‖∇f (x)−∇f (y)‖ ≤ Lf ‖x − y‖ .

g is lower semi-continuous and∫Rd e−g(y)dy ∈ (0,+∞).

IMS 2018 53 / 84

54/84

Properties of proximal operators and consequences

The function gλ are convex and continuously differentiable.

gλ is gradient-Lipschitz: for all x , y ∈ Rd ,∥∥∇gλ(x)−∇gλ(y)∥∥ ≤ λ−1 ‖x − y‖

Consequence: The function Uλ is convex, continuously differentiable andgradient-Lipschitz.

IMS 2018 54 / 84

55/84

Properties of proximal operators and consequences

Question Can Uλ be used instead of U in ULA (or any MCMC algorithm thatuses the gradient of U) ?

Uλ defines a regularized distribution πλ

πλ ∝ e−Uλ

, Uλ(x) = f (x) + gλ(x) .

A minimal requirement is that πλ ∝ e−Uλ

is a probability density function !

Theorem 7 ((Durmus,Moulines,Pereyra) SIAM J. Imaging Sci., 2018)

Under H1, for all λ > 0, 0 <∫Rd e−Uλ(y)dy < +∞.

IMS 2018 55 / 84

56/84

Some approximation results

Uλ defines a regularized version of πλ

πλ ∝ e−Uλ

, Uλ(x) = f (x) + gλ(x) .

Question: We now that limλ↓0+ Uλ(x) = U(x). Is this sufficient to guaranteethat πλ is a sensible approximation of π ?

Theorem 8 ((Durmus,Moulines,Pereyra) SIAM J. Imaging Sci., 2018)

Assume H1.

1. Then, limλ→0 ‖πλ − π‖TV = 0.

2. Assume in addition that g is Lipschitz. Then for all λ > 0,

‖πλ − π‖TV ≤ λ ‖g‖2Lip .

IMS 2018 56 / 84

57/84

Moreau-Yoshida approximations

p(x) ∝ exp (−|x |) p(x) ∝ exp(−x4

)p(x) ∝ 1[−0.5,0.5](x)

Figure: True densities (solid blue) and approximations (dashed red).

IMS 2018 57 / 84

58/84

The MYULA algorithm

Main idea: Target πλ ∝ e−Uλ

instead of π ∝ e−U using ULA.

Reasons:

πλ is a ”good” approximation of π provided that the regularizationparameter λ is small enoughUλ is continuously differentiable, gradient Lipschitz and convex

Given a regularization parameter λ > 0 and a stepsize γ > 0, the ULAapplied to πλ yields

XMk+1 = XM

k − γ∇f (XM

k ) + λ−1(XMk − proxλg (XM

k ))

+√

2Zk+1 ,

where Zk , k ∈ N∗ is a sequence of i.i.d. d-dimensional standard Gaussianrandom variables.

IMS 2018 58 / 84

59/84

Microscopy dataset

(a) (b)

Figure: Microscopy dataset, field of size 4µm × 4µm containing 100 molecules, (a)Original Observation, (b) MAP

Objective recover a high-resolution image x ∈ Rn from a blurred and noisyobservation y = Hx + w , where H is a circulant blurring matrix andw ∼ N (0, σ2

I n). This inverse problem is ill-conditioned, but this problem can bemitigated by exploiting prior knowledge.

IMS 2018 59 / 84

60/84

Microscopy dataset

The goal is to recover the image x ∈ Rn from an incomplete and noisy set ofFourier measurements y = AFx + w , where F is the discrete Fouriertransform operator, A is a tomographic sampling mask, and w ∼ N (0, σ2

I n).

This inverse problem is ill-posed, resulting in significant uncertainty about thetrue value of x .

Idea Use a LASSO (or horse-shoe) prior to promote sparse images(nearly-black object). The resulting posterior is

π(x) ∝ exp[−(‖y − AFx‖2/2σ2 + β‖x‖1)

].

with fixed hyper-parameters σ > 0 and β > 0.

This density is log-concave and MAP estimation can be performed efficientlyby proximal convex optimisation.

IMS 2018 60 / 84

61/84

HPD

point estimators such as xMAP deliver accurate results but do not provideinformation about the posterior uncertainty of x or ϕ(x) where ϕ is afunction.

This is formalised in the Bayesian decision theory framework by computinghighest posterior density (HPD)

Problem: compute posterior credibility sets for ϕ(x).

IMS 2018 61 / 84

62/84

Comparison with PMALA

(a) (b)

Figure: Microscopy experiment: (a) HDP region thresholds ηα for MYULA (2× 106

iterations λ = 1, γ = 0.6) and PMALA (2× 107 iterations), (b) relative approximationerror of MYULA.

IMS 2018 62 / 84

63/84

Sparse image deblurring

y xMAP uncertainty estimates

Figure: Live-cell microscopy data Zhu2012. Uncertainty analysis (±78nm ×±125nm)

Computing time 4 minutes. M = 105 iterations. Estimation error 0.2%..

IMS 2018 63 / 84

64/84

Outline


2 The ULA algorithm







IMS 2018 64 / 84

65/84

Densities with convex support

π has bounded support, log π = −∞ outside some domain: Supp(π) = K,where K ⊂ Rd is a convex body

π ∝ e−U , U = f + ιK , ιK(x) =

+∞ if x /∈ K,

0 if x ∈ K .,

where f is smooth.

The Moreau-Yosida envelope of ιK is given for the regularization parameterλ > 0 by

ιλK(x) = infy∈Rd

(ιK(y) + (2λ)−1 ‖x − y‖2

)= (2λ)−1 ‖x − projK (x)‖2

.

IMS 2018 65 / 84

66/84

Previous works

Previous work for the Metropolis algorithm and the hit-and-run : Applegate,Kannan, Dyer, Frieze, Polson, Simonovits, Lovasz, Vempala...

Our approach is more in the spirit of Bubeck, Eldan, and Lehec (2015) basedon the Projected Langevin Monte Carlo

IMS 2018 66 / 84

67/84

Main results - Assumptions

H2

f is convex, continuously differentiable on Rd and gradient Lipschitz withLipschitz constant Lf , i.e. for all x , y ∈ Rd , ‖∇f (x)−∇f (y)‖ ≤ Lf ‖x − y‖.

H3

There exist r ,R > 0, r ≤ R, such that B(0, r) ⊂ K ⊂ B(0,R).

IMS 2018 67 / 84

68/84

Main results - Statement

Theorem 9 (brosse:durmus:moulines:pereyra:2017)

Assume H2 and H3. For all ε > 0 and x ∈ Rd , there exist (explicit) λ > 0 andγ > 0 such that,

‖δxRnγ,λ − π‖TV ≤ ε for n = Ω(d5) ,

where Rγ,λ is the Markov kernel associated to (Xλk )k≥0.

Similar bounds hold for the Wasserstein distance.

IMS 2018 68 / 84

69/84

Comparison with existing results

Lovasz (2007) shows that the complexity of the RWM and the hit-and-runalgorithm are of order d4. However, this result requires that K iswell-rounded (which is difficult to check)

Bubeck et al (2015) studies the complexity of the Projected Langevin MonteCarlo algorithm (PLMC):

Xk+1 = projK (Xk − γ∇f (Xk)) .

Note that contrary to MYULA, the iterates of PLMC always belongs to K.Under similar assumptions, Bubeck et al. (2015) get explicit bounds in totalvariation for PLMC:

d → +∞ ε→ 0 R → +∞ r → 0

Bubeck:2015 π uniform on K O(d7) O(ε−8) O(R6) O(r−6)

Bubeck:2015 π log concave O(d12) O(ε−12) O(R18) O(r−18)

MYULA O(d5) O(ε−6) O(R4) O(r−4)

Table: Complexity ‖δx?Rnγ,λ − π‖TV ≤ ε

.

IMS 2018 69 / 84

70/84

Application to regression with `1 constraints

1. For all s > 0, consider the density πs ∝ e−Us

where

Us(x) = exp

(−‖Y − Zx‖2

2σ2− ιKs(x)

),Ks = x ∈ Rd ; ‖x‖1 ≤ s .

2. Dual problem of LASSO regression in optimization.

3. We compute for all i ∈ 1, · · · , d, the median of xi for different values of son the diabetes data set (n = 442, d = 10).

4. Compute the LASSO regularization paths.

IMS 2018 70 / 84

71/84

Application to regression with `1 constraints

−20

−10

010

2030

Shrinking Factor

Coe

ffici

ents

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

52

71

106

43

9

(a)−

100

1020

Shrinking FactorC

oeffi

cien

ts

0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1

52

71

104

39

(b)

Figure: Lasso paths for (a) MYULA , (b) Wall HMC (Neal 2010)

IMS 2018 71 / 84

72/84

Outline


2 The ULA algorithm







IMS 2018 72 / 84

73/84

Normalizing constants

Let U : Rd → R. We aim at estimating Z =∫Rd e−U(x)dx < +∞.

Z is the normalizing constant of the probability density π associated with thepotential U.

Many applications in Bayesian inference (Bayes factors) and statisticalphysics (free energy) .

In Bayesian inference, models can be compared Bayes factors which is theratio of two normalizing constants.

Many methods... But few theoretical guarantees.

Assumption U is a continuously differentiable convex function, minU = 0.

IMS 2018 73 / 84

74/84

Multistage sampling

Idea: decompose the original problem in a sequence of problems which areeasier to solve.

Multistage sampling method (Gelman,1998),

ZZ0

=M−1∏i=0

Zi+1

Zi,

where

1. M ∈ N? is the number of stages,2. Z0 is the initial normalizing constant (should be easy to compute)3. Zi+1/Zi are the ratios of normalisations constants (that should also be

easy to estimate).

IMS 2018 74 / 84

75/84

A Gaussian annealing algorithm

M ∈ N? number of stages.

Let σ2i Mi=0 be an increasing sequence of positive numbers and set

σ2M = +∞.

Consider the sequence of functions UiMi=0 defined for all i ∈ 0, . . . ,M andx ∈ Rd by

Ui (x) =‖x‖2

2σ2i

+ U(x) ,

with the convention 1/∞ = 0.

Note that UM = U, since σM = +∞.

If σ0 is small enough, then U0(x) ≈ ‖x‖2/(2σ0).

IMS 2018 75 / 84

76/84

A Gaussian annealing algorithm

Define sequence of probability densities πiMi=0 for i ∈ 0, . . . ,M andx ∈ Rd by

πi (x) = Z−1i e−Ui (x) , Zi =

∫Rd

e−Ui (y)dy .

It defines (Zi )Mi=1 in the decomposition

ZZ0

=M−1∏i=0

Zi+1

Zi,

For i ∈ 0, . . . ,M − 1, we get

Zi+1

Zi=

∫Rd

gi (x)πi (x)dx = πi (gi ) ,

where gi : Rd → R+ is defined for any x ∈ Rd by

gi (x) = exp(ai ‖x‖2

), ai =

1

2

(1

σ2i

− 1

σ2i+1

).

IMS 2018 76 / 84

77/84

Multistage methods

Multistage sampling type algorithms are widely used and known underdifferent names: multistage sampling (Valleau,1972), (extended) bridgesampling (Gelman,1998), annealed importance sampling (AIS) (Neal,2001),thermodynamic integration (Girolami,2016), power posterior (Friel,2012).

For the stability and accuracy of the method, the choice of the parameters iscrucial and is known to be difficult.

Indeed, the issue has been pointed out in several articles under the names oftuning tempered transitions (Friel,2012), temperature placement (Friel,2014),annealing sequence, temperature ladder (Girolami,2016), effects of grid size...

In Brosse,Durmus,Mmoulines (2018), we explicitly determine the sequenceσ2

i M−1i=0 .

IMS 2018 77 / 84

78/84

Multistage Langevin

Compute for all i ∈ 1, . . . ,M − 1,

Zi+1

Zi=

∫Rd

gi (x)πi (x)dx = πi (gi ) .

The quantity πi (gi ) is estimated by the Unadjusted Langevin Algorithm(ULA) targeting πi .

For all i ∈ 1, . . . ,M, consider

Xi,k+1 = Xi,k − γi∇Ui (Xi,k) +√

2γiZi,k+1 , Xi,0 = 0 .

For i ∈ 0, . . . ,M − 1, consider the following estimator of Zi+1/Zi ,

πi (gi ) =1

ni

Ni+ni∑k=Ni+1

gi (Xi,k) ,

where ni ≥ 1 is the sample size and Ni ≥ 0 the burn-in period.

IMS 2018 78 / 84

79/84

ULA algorithm

We want to compute for all i ∈ 1, . . . ,M − 1,

Zi+1

Zi=

∫Rd

gi (x)πi (x)dx = πi (gi ) ,

For i ∈ 0, . . . ,M − 1, consider the following estimator of Zi+1/Zi ,

πi (gi ) =1

ni

Ni+ni∑k=Ni+1

gi (Xi,k) ,

where ni ≥ 1 is the sample size and Ni ≥ 0 the burn-in period.

Z the following estimator of Z,

Z = (2πσ20)d/2(1 + σ2

0m)−d/2

M−1∏i=0

πi (gi )

,

IMS 2018 79 / 84

80/84

Theoretical analysis

Denote by S the set of simulation parameters,

S =M, σ2

i M−1i=0 , γiM−1

i=0 , niM−1i=0 , NiM−1

i=0

.

Z the following estimator of Z,

Z = (2πσ20)d/2(1 + σ2

0m)−d/2

M−1∏i=0

πi (gi )

.

cost of the algorithm: cost =∑M−1

i=0 Ni + ni.

Theorem 10 (brosse:durmus:moulines:2018)

Let µ, ε ∈ (0, 1). There exists an explicit choice of the simulation parameters Ssuch that the estimator Z satisfies

P(∣∣∣Z/Z − 1

∣∣∣ > ε)≤ µ .

Moreover, the cost of the algorithm is polynomial in the dimension d , ε−1 andη−1.

IMS 2018 80 / 84

81/84

Numerical experiments

8.80

8.85

8.90

8.95

Dimension 10

22.5

522

.60

22.6

522

.70

Dimension 25

45.6

045

.65

45.7

045

.75

45.8

0

Dimension 50

Figure: Boxplots of the logarithm of the normalizing constants of a multivariateGaussian distribution in dimension d ∈ 10, 25, 50.

IMS 2018 81 / 84

82/84

Numerical experiements II

−27

9.80

−27

9.76

−27

9.72

−27

9.68

Figure: Boxplot of the log evidence for a mixture of 4 Gaussian distributions indimension 2.

IMS 2018 82 / 84

83/84

Our published work

1. Durmus, A.; Moulines, E. Quantitative bounds of convergence forgeometrically ergodic Markov chain in the Wasserstein distance withapplication to the Metropolis adjusted Langevin algorithm. Stat. Comput.,2015

2. Durmus, A.; Moulines, E., Non-asymptotic convergence analysis for theUnadjusted Langevin Algorithm, Ann. Appl. Probab., 2017

3. Durmus, A.; Moulines, E., High-dimensional Bayesian inference via theUnadjusted Langevin Algorithm, major revision in Bernoulli, 2018 [firstversion submitted 3 years ago !]

4. Durmus, A.; Simsleki, U.; Moulines, E.; Badeau, Roland, Stochastic GradientRichardson-Romberg Markov Chain Monte Carlo, NIPS, 2016

5. Brosse, N., Durmus A., Moulines E., Pereyra, M., Sampling from alog-concave distribution with compact support with proximal Langevin MonteCarlo, COLT 2017 Efficient Bayesian computation by proximal Markov chainMonte Carlo: when Langevin meets Moreau, SIAM J. Imaging Sciences, 2018.

IMS 2018 83 / 84

84/84

1

ORFE

ISBN 978-3-319-97703-4

Springer Series in Operations Research and Financial Engineering

Randal Douc · Eric Moulines Pierre Priouret · Philippe Soulier

Markov Chains

Markov Chains

Douc · Moulines · Priouret

Soulier

Randal Douc · Eric Moulines · Pierre Priouret · Philippe Soulier

Markov Chains

This book covers the classical theory of Markov chains on general state-spaces as well as many recent developments. The theoretical results are illustrated by simple examples, many of which are taken from Markov Chain Monte Carlo methods. The book is self-contained, while all the results are carefully and concisely proven. Bibliographical notes are added at the end of each chapter to provide an overview of the literature.

Part I lays the foundations of the theory of Markov chain on general states-space. Part II covers the basic theory of irreducible Markov chains on general states-space, relying heavily on regeneration techniques. These two parts can serve as a text on general state-space applied Markov chain theory. Although the choice of topics is quite different from what is usually covered, where most of the emphasis is put on countable state space, a graduate student should be able to read almost all these developments without any mathematical background deeper than that needed to study countable state space (very little measure theory is required).

Part III covers advanced topics on the theory of irreducible Markov chains. The emphasis is on geometric and subgeometric convergence rates and also on computable bounds. Some results appeared for a first time in a book and others are original. Part IV are selected topics on Markov chains, covering mostly hot recent developments.

Mathematics

9 783319 977034

Figure: A beach reading or a nice Christmas present for a loved one... This new bookcovers the classical theory of Markov chains on general state-spaces as well as manyrecent developments. The theoretical results are illustrated by simple examples, many ofwhich are taken from Markov Chain Monte Carlo methods.

IMS 2018 84 / 84

langevin mcmc: theory and methods bayesian computation opening workshop · 2018-08-27 · 1/84...

Documents