a tour of transport methods for bayesian computation€¦ · bayesian inference in large-scale...

A tour of transport methods for Bayesiancomputation

Youssef Marzoukjoint work with Ricardo Baptista, Daniele Bigoni,

Matthew Parno, & Alessio Spantini

Department of Aeronautics and Astronautics

Center for Computational Engineering

Statistics and Data Science Center

Massachusetts Institute of Technology

http://uqgroup.mit.edu

Support from DOE ASCR, NSF, DARPA

3 December 2018

Marzouk et al. MIT 1 / 44

http://uqgroup.mit.edu

Bayesian inference in large-scale models

Observations y Parameters x

⇡pos

(x

) := ⇡(x |y) / ⇡(y |x)⇡pr

(x

)| {z }

Bayes’ rule

I Characterize the posterior distribution (density ⇡pos

)I This is a challenging task since:

Ix 2 Rn is typically high-dimensional (e.g., a discretized function)

I ⇡pos

is non-GaussianI evaluations of the likelihood (hence ⇡

pos

) may be expensiveI ⇡

pos

can be evaluated up to a normalizing constantMarzouk et al. MIT 2 / 44

Sequential Bayesian inference

I State estimation (e.g., filtering and smoothing) or joint state and

parameter estimation, in a Bayesian settingI Need recursive, online algorithms for characterizing the posterior

distribution


Computational challenges

I Extract information from the posterior (means, covariances, event

probabilities, predictions) by evaluating posterior expectations:

E⇡pos

[h(x)] =

Zh(x)⇡

pos

(x)dx

I Key strategies for making this computationally tractable

1 Approximations of the forward model (e.g., polynomialapproximations, Gaussian process emulators, reduced order models,multi-fidelity approaches)

2 Efficient and structure-exploiting sampling schemes


Computational challenges

I Extract information from the posterior (means, covariances, event

probabilities, predictions) by evaluating posterior expectations:

E⇡pos

[h(x)] =

Zh(x)⇡

pos

(x)dx

I Key strategies for making this computationally tractable

1 Approximations of the forward model (e.g., polynomialapproximations, Gaussian process emulators, reduced order models,multi-fidelity approaches)

2 Efficient and structure-exploiting sampling schemes

I This talk: relate to notions of coupling and transport!


Deterministic couplings of probability measures

�(�) p(r)

�̃(r)

T (�)

T̃ (�)

Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure � to the standard Gaussianreference p while the approximate map only captures some of the structure in �,producing an approximation p̃ to the reference Gaussian.

µr does not contain any point masses and the cost function c(�, T (�)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].

Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when� � N(0, I) and r � N(0, �) for some covariance matrix �. In this Gaussian example,

the transport map will be linear: ri.d.= �1/2�, where �1/2 is any one of the many

square roots of �. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by

cEig(�, T (�)) = �� T (�)�2, (2.3)

the optimal square root, �1/2, will be defined by the eigenvalue decomposition of �,but when the cost is given by the limit of a a weighted quadratic defined by

cRos(�, T (�)) = limt�0

DX

k=1

tk�1|�k � Tk(�)|, (2.4)

the optimal square root, �1/2, will be defined by the Cholesky decomposition of �.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].

The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting

31

�(�) p(r)

�̃(r)

T (�)

T̃ (�)






cEig(�, T (�)) = �� T (�)�2, (2.3)


cRos(�, T (�)) = limt�0

DX

k=1

tk�1|�k � Tk(�)|, (2.4)



31

T

η π

Core idea

I Choose a reference distribution ⌘ (e.g., standard Gaussian)I Seek a transport map T : Rn ! Rn such that T]⌘ = ⇡



�(�) p(r)

�̃(r)

T (�)

T̃ (�)






cEig(�, T (�)) = �� T (�)�2, (2.3)


cRos(�, T (�)) = limt�0

DX

k=1

tk�1|�k � Tk(�)|, (2.4)



31

�(�) p(r)

�̃(r)

T (�)

T̃ (�)






cEig(�, T (�)) = �� T (�)�2, (2.3)


cRos(�, T (�)) = limt�0

DX

k=1

tk�1|�k � Tk(�)|, (2.4)



31

S =T−1

η π

Core idea


I Equivalently, find S = T

�1 such that S]⇡ = ⌘



�(�) p(r)

�̃(r)

T (�)

T̃ (�)






cEig(�, T (�)) = �� T (�)�2, (2.3)


cRos(�, T (�)) = limt�0

DX

k=1

tk�1|�k � Tk(�)|, (2.4)



31

�(�) p(r)

�̃(r)

T (�)

T̃ (�)






cEig(�, T (�)) = �� T (�)�2, (2.3)


cRos(�, T (�)) = limt�0

DX

k=1

tk�1|�k � Tk(�)|, (2.4)



31

S =T−1

η π

Core idea




I Enables exact (independent, unweighted) sampling!



�(�) p(r)

�̃(r)

T (�)

T̃ (�)






cEig(�, T (�)) = �� T (�)�2, (2.3)


cRos(�, T (�)) = limt�0

DX

k=1

tk�1|�k � Tk(�)|, (2.4)



31

�(�) p(r)

�̃(r)

T (�)

T̃ (�)






cEig(�, T (�)) = �� T (�)�2, (2.3)


cRos(�, T (�)) = limt�0

DX

k=1

tk�1|�k � Tk(�)|, (2.4)



31

S =T−1

η π

Core idea




I Satisfying these conditions only approximately may still be useful!


Topics for this talk

Three vignettes on transport in Bayesian computation:

1 Variational Bayesian inference

2 Accelerating Markov chain Monte Carlo

3 Nonlinear ensemble filtering and map estimation


Choice of transport map

A useful building block is the Knothe-Rosenblatt rearrangement:

T (x) =

2

6664

T

1(x1

)T

2(x1

, x2

)...T

n(x1

, x2

, . . . , xn

)

3

7775

I Exists and is unique (up to ordering) under mild conditions on ⌘,⇡I Jacobian determinant easy to evaluateI “Exposes” marginals, enables conditional sampling. . .

I Numerical approximations can employ a monotone parameterization

guaranteeing @x

k

T

k > 0; for example,

T

k(x1

, . . . , xk

) = a

k

(x1

, . . . , xk�1

)+

Zx

k

0

exp(bk

(x1

, . . . , xk�1

,w)) dw


Variational inference

Variational characterization of the direct map T [Moselhy & M 2012]:

minT2T4

DKL

(T] ⌘ ||⇡ ) = minT2T4

DKL

( ⌘ ||T�1

] ⇡ )

I T4 is the set of monotone lower triangular mapsI Contains the Knothe-Rosenblatt rearrangement

I Expectation is with respect to the reference measure ⌘I Compute via, e.g., Monte Carlo, sparse quadrature

I Use unnormalized evaluations of ⇡ and its gradientsI No MCMC or importance sampling


Simple example

minT

E⌘[� log⇡ � T �X

k

log @x

k

T

k ]

I Parameterized map T 2 T h

4 ⇢ T4I Optimize over coefficients of

parameterizationI Use gradient-based optimizationI The posterior is in the tail of the reference


Useful features

I Move samples; don’t just reweigh themI

Independent and cheap samples: x

i

⇠ ⌘ ) T

(x

i

)

I Clear convergence criterion, even with unnormalized target density:

DKL

(T] ⌘ ||⇡ ) ⇡12Var⌘

"

log⌘

T

�1

] ⇡̄

#

I Can either accept bias or reduce it by:I Increasing the complexity of the map T 2 T h

4I Sampling the pullback T

�1

] ⇡ using MCMC or importance sampling

I Many recent constructions also employ transport for variational inference(Stein variational gradient descent [Liu & Wang 2016], normalizing flows[Rezende & Mohamed 2015]) or for sampling (Gibbs flows [Heng et al.

2015], particle flow filter [Reich 2011], implicit sampling [Chorin et al.

2009–2015])


Low-dimensional structure

I Key challenge: maps in high dimensionsI Major bottleneck: representation of the map, e.g., cardinality of the

map basisI How to make the construction/representation of high-dimensional

transports tractable?

I Main idea: exploit Markov structure of the target distributionI Leads to various low-dimensional properties of transport maps

[Spantini, Bigoni, & M JMLR 2018]:1 Decomposability2 Sparsity3 Low rank


Markov random fields

I Let Z

1

, . . . ,Zn

be random variables with joint density ⇡ > 0

A BS

1

(i , j) /2 E iff Z

i

?? Z

j

|ZV\{i ,j}

I G encodes conditional independence (an I -map for ⇡)


Decomposable transport maps

I Definition: a decomposable transport is a map T = T

1

� · · · � T

k

that factorizes as the composition of finitely many maps of loweffective dimension that are triangular (up to a permutation), e.g.,

T (x) =

2

6666664

A

1

(x1

, x2

, x3

)B

1

(x2

, x3

)C

1

(x3

)x4

x5

x6

3

7777775

| {z }T

1

�

2

6666664

x1

A

2

(x2

, x3

, x4

, x5

)B

2

(x3

, x4

, x5

)C

2

(x4

, x5

)D

2

(x5

)x6

3

7777775

| {z }T

2

�

2

6666664

x1

x2

x3

A

3

(x4

)B

3

(x4

, x5

)C

3

(x4

, x5

, x6

)

3

7777775

| {z }T

3

I Theorem: [Spantini et al. (2018)] Decomposable graphical models for ⇡lead to decomposable direct maps T , provided that ⌘(x) =

Qi

⌘(xi

)


Transport maps and graphical models

Key message

I Enforce decomposable structure in the approximation space T4,i.e., when solving min

T2T4 DKL

(T]⌘ ||⇡ )

I A general tool for modeling and computation with non-GaussianMarkov random fields

I In many situations, elements of the compositionT = T

1

� T

2

� · · · � T

k

can be constructed sequentially


Application to state-space models

I Nonlinear/non-Gaussian state-space model with static params ⇥:I Transition density ⇡

Z

k

|Zk�1

,⇥

I Observation density (likelihood) ⇡Y

k

|Zk

Z0 Z1 Z2 Z3 ZN

Y0 Y1 Y2 Y3 YN

⇥

1

I Interested in recursively updating the full Bayesian solution:⇡

Z

0:k ,⇥ | y0:k! ⇡

Z

0:k+1

,⇥ | y0:k+1

(smoothing + sequential parameter inference)


Example: stochastic volatility model

I Stochastic volatility model: Latent log-volatilities take the form ofan AR(1) process for t = 1, . . . ,N:

Z

t+1

= µ+ � (Zt

� µ) + ⌘t

, ⌘t

⇠ N (0, 1), Z

1

⇠ N (0, 1/1� �2)

I Observe the mean return for holding an asset at time t

Y

t

= "t

exp( 0.5 Z

t

), "t

⇠ N (0, 1), t = 1, . . . ,N

I Markov structure for ⇡ ⇠ µ,�,Z1:N |Y1:N is given by:

A BS1 2 3 4 N

µ „

1


Example: stochastic volatility model

I Build the decomposition recursivelyT = Id�T

1

�TN�1

A BS1 2 3 4 N

µ „

1

I Figure: Markov structure for the pullback of ⇡ through T

I Start with the identity map


Stochastic volatility model

I Build the decomposition recursivelyT = Id�T

1

�TN�1

A BS1 2 3 4 N

µ „

1


I Find a good first decomposition of G



I Build the decomposition recursivelyT = T

0

�Id�TN�1

A BS1 2 3 4 N

µ „

1


I Compute an (essentially) 4-D T

0

and pull back ⇡I Underlying approximation of µ,�,Z

1

|Y1




0

�Id�TN�1

A A BS1 2 3 4 N

µ „

1


I Find a new decompositionI Underlying approximation of µ,�,Z

1

|Y1




0

� T

1

�Id�TN�1

A A BS1 2 3 4 N

µ „

1


I Compute an (essentially) 4-D T

1

and pull back ⇡I Underlying approximation of µ,�,Z

1:2|Y1:2




0

� T

1

�Id�TN�1

A A BS1 2 3 4 N

µ „

1


I Continue the recursion until no edges are left. . .I Underlying approximation of µ,�,Z

1:2|Y1:2




0

� T

1

� T

2

�Id � T

N�1

A A BS1 2 3 4 N

µ „

1



1:3|Y1:3




0

� T

1

� T

2

� · · · � T

N�3

�Id

A A1 2 3 4 N

µ „

1



1:N�1

|Y1:N�1




0

� T

1

� T

2

� · · · � T

N�3

� T

N�2

�Id

A A1 2 3 4 N

µ „

1


I Each map T

k

is essentially 4-D regardless of N

I Underlying approximation of µ,�,Z1:N |Y1:N


The decomposable map

T(x) =

2

666666666664

P

0

(x✓)A

0

(x✓, x0

, x1

)B

0

(x✓, x1

)x2

x3

x4

...xN

3

777777777775

| {z }T

0

�

2

666666666664

P

1

(x✓)x0

A

1

(x✓, x1

, x2

)B

1

(x✓, x2

)x3

x4

...xN

3

777777777775

| {z }T

1

�

2

666666666664

P

2

(x✓)x0

x1

A

2

(x✓, x2

, x3

)B

2

(x✓, x3

)x4

...xN

3

777777777775

| {z }T

2

� · · ·

I (P0

� · · · � P

k

)] ⌘⇥ = ⇡⇥ |Y0:k+1

(parameter inference)


Stochastic volatility example

I Infer log-volatility of the pound/dollar exchange rate, starting on 1October 1981

I Filtering (blue) versus smoothing (red) marginals


Smoothing marginals

I Just re-evaluate the 4-D maps backwards in timeI Comparison with a “reference” MCMC solution with 105 ESS (in red)


Static parameter �

I Sequential parameter inferenceI Comparison with a “reference” MCMC solution (batch algorithm)


Static parameter µ

I Slow accumulation of error over time (sequential algorithm)I Acceptance rate 75% for Metropolis independence sampler with

transport proposal


Long-time smoothing (25 years)

9/11

Lehman Brothers bankrupcy

Brexit referendum

I Python code available at http://transportmaps.mit.edu

http://transportmaps.mit.edu

Vignette #2: Transport + MCMC

I In general, the variational approach yields an approximation T 2 T h

4of the exact transport map

I Can we still achieve exact posterior exampling?I Are very cheap or crude approximations still useful?

Key idea: combining map construction with Markov chain Monte Carlo(MCMC)I Posterior sampling + convex optimizationI Transport map “preconditions” MCMC sampling; posterior samples

enable map constructionI Can be understood in the framework of adaptive MCMC


Preconditioning MCMC

I MCMC algorithms are a workhorse of Bayesian computation

I Effective = adapted to the targetI Can we transform proposals or targets for better sampling?


Constructing a transport map from samples

I Seek inverse transport, from target to referenceI Candidate map S yields an approximation S

�1

] ⇡ref

of the targetI Variational characterization of the map:

minS2S4

DKL

(S] ⇡tar

||⇡ref

) = minS2S4

DKL

(⇡tar

||S�1

] ⇡ref

)

r

✓

⇡ref = N(0, I)

S(✓)

⇡tar(✓)

S�1] ⇡ref


Constructing a transport map from samples

I Additional structureI Choose ⇡

ref

= ⌘ to be standard GaussianI Seek a monotone lower triangular map S 2 S4I Samples ✓(i) ⇠ ⇡

tar

approximate the expectationI Yields a convex and separable optimization problem

arg minS2S4

DKL

(⇡tar

||S�1

] ⌘ ) = arg maxS2S4

E⇡tar

[log ⌘ � S + log detrS

]

I Sample-average approximation for each map component S

k ,k = 1 . . . n:

maxS

k

1M

MX

i=1

�12

⇣S

k(✓(i))⌘

2

+ log @k

S

k(✓(i))

I Equivalent to maximum likelihood estimation for S

I Parameterize a finite space of monotone triangular maps Sh

4 andoptimize over coefficients


• Ingredient #1: static map– Idea: perform MCMC in the reference space, on a

“preconditioned” density– Simple proposal in reference space (e.g., random walk)

corresponds to a more complex/tailored proposal on target

r

✓

p̃(r)

T̃ (✓)⇡(✓)

Map-accelerated MCMC

S(θ)

• Ingredient #2: adaptive map– Update the map with each MCMC iteration:

more samples from π, more accurate , better– Adaptive MCMC [Haario 2001, Andrieu 2006], but with

nonlinear transformation to capture non-Gaussian structure

S

r

✓

T̃k(✓)

p̃k(r)

⇡(✓)


Sk(θ)

Eπ

r

✓T̃k+1(✓)

⇡(✓)

p̃k+1(r)

• Ingredient #2: adaptive map– Update the map with each MCMC iteration:

more samples from π, more accurate , better– Adaptive MCMC [Haario 2001, Andrieu 2006], but with

nonlinear transformation to capture non-Gaussian structure

Eπ


S

Sk+1(θ)

• Ingredient #3: global proposals– If the map becomes sufficiently accurate, would like to avoid

random-walk behavior


reference RW proposal mapped RW proposal

qr( ′r | r) = N(r,σ2I ) qθ( ′θ | θ) = qr S( ′θ ) |S(θ)( )∇S( ′θ )


random-walk behavior


reference independence proposal mapped independence proposal

qr( ′r ) = N(0,I ) qθ( ′θ ) = qr S( ′θ )( )∇S( ′θ )


random-walk behavior– Solution: delayed rejection MCMC [Mira 2001]– First proposal = independent sample from η (global, more

efficient); second proposal = random walk (local, more robust)

• Entire scheme is provably ergodic with respect to the exact posterior measure [Parno & M, SIAM JUQ 2018]– Requires enforcing some regularity conditions on maps, to preserve

tail behavior of transformed target


Example: biological oxygen demand model

I Likelihood model:d = ✓

1

(1� exp (�✓2

x)) + ✏

✏ ⇠ N

�0, 2⇥ 10�4

�

I 20 noisy observations at

x =

⇢55,65, . . . ,

255

�

I Degree-three polynomial map

True posterior density


Results: ESS per computational effort

TM+DR

TM+NUTS

TM+LANNUTS

DRAM0

100

200

300

161

57

179

1.4 5.8

ESS/(1,000 Evaluations) – ✓1

TM+DRG

TM+DRL

TM+MIXNUTS

DRAM0

500

1,000

1,500 1468

487

1495

57127

ESS/second– ✓1


Transformed distribution

Original posterior ⇡tar

Pushforward posterior S]⇡tar


Example: maple sap dynamics model

I Coupled PDE system forice, water, and gaslocations [Ceseri &Stockie 2013]

I Measure gas pressure invessel

I Infer 10 physical modelparameters

I Very challengingposterior!

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

654 MAURIZIO CESERI AND JOHN M. STOCKIE

R +R+rRs0

x

R+R−rs R+2R

water

fiber/vessel vesselwall

f

fiber

R

v

sgisiw

gasice

water

gas

gi iwf f v f v f v

r

R

Fig. 3.1. (Bottom) A 2D cross-section through a fiber-vessel pair showing the water, ice, andgas regions and the moving interfaces. (Top) The 1D region corresponding to our simplified modelgeometry is outlined with a red (dashed line) box in the bottom of the figure. The locations of thevarious phase boundaries are indicated, with the origin x = 0 at the center of the fiber.

gas is also present in the vessel. Further justification for this assumption isprovided in the papers [25, 30], which indicate that maple trees experiencewinter embolism (bubble formation) when gas in the vessels is forced out ofsolution upon freezing.

A8. Gas in both fiber and vessel takes the form of a cylindrical bubble locatedat the center of the corresponding cell. This seems reasonable in the vesselwhere the surface tension is of the order of �/Rv � 4 � 103 Pa, which isseveral orders of magnitude smaller than the typical gas and liquid pressures.In the fiber, the smaller radius gives rise to a much larger surface tension(�/Rf � 2 � 104 Pa) which though still small relative to gas and liquidpressures could still potentially initiate a break-up into smaller bubbles owingto the Plateau–Rayleigh instability [7]. However, regardless of the preciseconfiguration of the gas in the fiber, we assume that the net e�ect on gaspressure is equivalent to that of a single large bubble.

A9. Heat from outside the tree enters from the right in Figure 3.1. Consequently,the sap in the vessel is taken to be in liquid form initially, and the ice in thefiber begins melting on the inner surface of the fiber wall.

A10. Gas and ice temperatures in the fiber can be taken as constant and equal tothe freezing point. This is justified by the fiber length scale being so muchsmaller than that of the vessel (Rf � Rv).

A11. Time scales for heat and gas di�usion are much shorter than those corre-sponding to ice melting and subsequent phase interface motion (which oursimulations show are on the order of minutes to hours). This is justified byconsidering the various di�usion coe�cients (D) and a typical length scale

Dow

nloa

ded

02/2

6/15

to 1

8.10

1.8.

142.

Red

istri

butio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.si

am.o

rg/jo

urna

ls/o

jsa.

php

Image from Ceseri and Stockie, 2013


Maple posterior distribution

✓1

✓2

✓3

✓4

✓5

✓6

✓7

✓8

✓9

✓10

�1 �0.5 0 0.5

0

1

2

✓1

✓ 7

�1 �0.5 0 0.5 1�0.6

�0.4

�0.2

0

0.2

0.4

✓3

✓ 6


Results: ESS per computational effort

TM+DRG

TM+DRL

TM+MIXDRAM

0

2

4

6

8

10

5.7

10

2.9

0.6

ESS/(10,000 Evaluations)

TM+DRG

TM+DRL

TM+MIXDRAM

0

5

10

15

20

25

18

26

7.1

2.3

ESS/(1000 seconds)


Comments on MCMC with transport maps

Useful characteristics of the algorithm:I Map construction is easily parallelizableI Requires no gradients from posterior density

Generalizes many current MCMC techniques:I Adaptive Metropolis: map enables non-Gaussian proposals and a

natural mixing between local and global movesI Manifold MCMC [Girolami & Calderhead 2011]: map also defines a

Riemannian metric


Vignette #3: ensemble filtering and map estimation

Z0 Z1 Z2 Z3 ZN

Y0 Y1 Y2 Y3 YN

1

I Consider the filtering of state-space models with:1 High-dimensional states2 Challenging nonlinear dynamics (e.g., chaotic systems)3 Intractable transition kernels: can simulate from ⇡

Z

k+1

|Zk

but cannotevaluate its density

4 Limited model evaluations, e.g., small ensemble sizes5 Sparse and local observations in space/time

I These constraints reflect typical challenges faced in numericalweather prediction, geophysical data assimilation


Ensemble Kalman filter

I State-of-the-art results (in terms of tracking) are typically obtainedwith the ensemble Kalman filter (EnKF)

⇡Zk�1|Y 0:k�1

forecast step analysis step

Bayesian inference

⇡Zk|Y 0:k�1⇡Zk|Y 0:k

I Move samples via a linear transformation; no weights or resampling!I Yet ultimately inconsistent: does not converge to the true posterior

Can we generalize the EnKF while preserving scalability, via nonlin-ear transformations?


Inference as a transportation of measures

I Seek a map T that pushes forward prior to posterior(x

1

, . . . , xM

) ⇠ ⇡X

=) (T (x

1

), . . . ,T (xN

)) ⇠ ⇡X|Y=y

⇤

I The map induces a coupling between prior and posterior measures

xi

T (xi)

⇡X|Y =y⇤⇡X

transport map

How to construct a “good” coupling from very few prior samples?


A novel filtering algorithm with maps

T (y,x)

⇡X|Y =y

⇤⇡X

⇡Y ,X

joint

xi

⇡Y |X=xi

Transport map ensemble filter1 Compute forecast ensemble x

1

, . . . , xM

2 Generate samples (yi

, xi

) from ⇡Y,X with y

i

⇠ ⇡Y|X=x

i

3 Build an estimator bT of T

4 Compute analysis ensemble as xai

= bT (y

i

, xi

) for i = 1, . . . ,M


Regularized map estimation

bS

k 2 arg minS

k2Sh

4,k

1M

MX

i=1

✓12S

k(xi

)2 � log @k

S

k(xi

)

◆

I In general, solve via convex optimizationI Connection to EnKF: a linear parameterization of b

S

k yields aparticular form of EnKF with “perturbed observations”

I Choice of approximation space allows control of the bias andvariance of b

S

I Richer parameterizations yield less bias, but potentially higher variance

I Strategy in high dimensions: gradually introduce nonlinearities, alwaysimpose sparsityI Explicit link between sparsity of a nonlinear map S and conditional

independence in non-Gaussian graphical models [Spantini, Bigoni, M2018]


Lorenz 96 in chaotic regime (40-dimensional state)

I A hard test-case configuration [*Bengtsson et al. 2003]:

dZj

dt

= (Zj+1

� Zj�2

)Zj�1

� Zj

+ F , j = 1, . . . , 40

Yj

= Zj

+ Ej

, j = 1, 3, 5 . . . , 39

IF = 8 (chaotic) and E

j

⇠ N (0, 0.5) (small noise for PF)I Time between observations: �

obs

= 0.4 (large)I Results computed over 2000 assimilation cycles

#particles: 400 #particles: 200RMSE *EnKF ⇡EnKF MapF ⇡EnKF MapFmedian 0.88 0.77 0.61 0.79 0.66mean 0.97 0.84 0.65 0.86 0.73mad - 0.14 0.11 0.16 0.13std 0.35 0.30 0.21 0.31 0.31

I The nonlinear filter is ⇡ 25% more accurate in RMSE than EnKF

Lorenz 96: details on the filtering approximation

1

I Observations were assimilated one at a timeI Impose sparsity of the map with a 5-way interaction model (above)I Separable and nonlinear parameterization of each component

bS

k(xj

1

, . . . , xj

p

, xk

) = (xj

1

) + . . .+ (xj

p

) + (xk

),

where (x) = a

0

+ a

1

· x +P

i>1

a

i

exp(�(x � c

i

)2/�).I Much more general parameterizations are of course possible


Lorenz 96: tracking performance of the filter

I Simple and and localized nonlinearities can have much impact!


Conclusions

Bayesian inference through the construction of deterministic couplings:I Variational Bayesian inference; demonstrated for filtering, smoothing,

and sequential parameter inference

I Exploiting approximate maps: map-accelerated MCMCI New nonlinear ensemble filtering schemes

Ongoing work:I Error analysis of approximate filtering schemesI Sparse recovery of transport maps from few samplesI

Structure learning for continuous non-Gaussian Markov random fieldsI Mapping sparse quadrature or QMC schemesI Nonparametric transports and gradient flows (e.g., Stein variational

methods)I

Low-rank transports, likelihood-informed subspaces (LIS), etc.


References

I Preprint on the ensemble filtering scheme is forthcomingI A. Spantini, D. Bigoni, Y. Marzouk. “Inference via low-dimensional couplings.”

JMLR, 2018; arXiv:1703.06131I M. Parno, Y. Marzouk, “Transport map accelerated Markov chain Monte Carlo.”

SIAM JUQ 6: 645–682, 2018.I G. Detomasso, T. Cui, A. Spantini, Y. Marzouk, R. Scheichl, “A Stein variational

Newton method.” NIPS 2018, arXiv:1806.03085.I R. Morrison, R. Baptista, Y. Marzouk. “Beyond normality: learning sparse

probabilistic graphical models in the non-Gaussian setting.” NIPS 2017;arXiv:1711.00950

I Y. Marzouk, T. Moselhy, M. Parno, A. Spantini, “An introduction to sampling viameasure transport.” Handbook of Uncertainty Quantification, R. Ghanem, D.Higdon, H. Owhadi, eds. Springer (2016). arXiv:1602.05023. (broad introductionto transport for sampling)

I T. Moselhy, Y. Marzouk, “Bayesian inference with optimal maps.” J. Comp. Phys.,231: 7815–7850, 2012.

I Python code at http://transportmaps.mit.edu, map-accelerated MCMC inhttp://muq.mit.edu


http://transportmaps.mit.edu

http://muq.mit.edu

a tour of transport methods for bayesian computation€¦ · bayesian inference in large-scale...

Documents