a tour of transport methods for bayesian computation€¦ · bayesian inference in large-scale...

76
A tour of transport methods for Bayesian computation Youssef Marzouk joint work with Ricardo Baptista, Daniele Bigoni, Matthew Parno, & Alessio Spantini Department of Aeronautics and Astronautics Center for Computational Engineering Statistics and Data Science Center Massachusetts Institute of Technology http://uqgroup.mit.edu Support from DOE ASCR, NSF, DARPA 3 December 2018 Marzouk et al. MIT 1 / 44

Upload: others

Post on 15-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

A tour of transport methods for Bayesiancomputation

Youssef Marzoukjoint work with Ricardo Baptista, Daniele Bigoni,

Matthew Parno, & Alessio Spantini

Department of Aeronautics and Astronautics

Center for Computational Engineering

Statistics and Data Science Center

Massachusetts Institute of Technology

http://uqgroup.mit.edu

Support from DOE ASCR, NSF, DARPA

3 December 2018

Marzouk et al. MIT 1 / 44

Page 2: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Bayesian inference in large-scale models

Observations y Parameters x

⇡pos

(x

) := ⇡(x |y) / ⇡(y |x)⇡pr

(x

)| {z }

Bayes’ rule

I Characterize the posterior distribution (density ⇡pos

)I This is a challenging task since:

Ix 2 Rn is typically high-dimensional (e.g., a discretized function)

I ⇡pos

is non-GaussianI evaluations of the likelihood (hence ⇡

pos

) may be expensiveI ⇡

pos

can be evaluated up to a normalizing constantMarzouk et al. MIT 2 / 44

Page 3: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Sequential Bayesian inference

I State estimation (e.g., filtering and smoothing) or joint state and

parameter estimation, in a Bayesian settingI Need recursive, online algorithms for characterizing the posterior

distribution

Marzouk et al. MIT 3 / 44

Page 4: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Computational challenges

I Extract information from the posterior (means, covariances, event

probabilities, predictions) by evaluating posterior expectations:

E⇡pos

[h(x)] =

Zh(x)⇡

pos

(x)dx

I Key strategies for making this computationally tractable

1 Approximations of the forward model (e.g., polynomialapproximations, Gaussian process emulators, reduced order models,multi-fidelity approaches)

2 Efficient and structure-exploiting sampling schemes

Marzouk et al. MIT 4 / 44

Page 5: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Computational challenges

I Extract information from the posterior (means, covariances, event

probabilities, predictions) by evaluating posterior expectations:

E⇡pos

[h(x)] =

Zh(x)⇡

pos

(x)dx

I Key strategies for making this computationally tractable

1 Approximations of the forward model (e.g., polynomialapproximations, Gaussian process emulators, reduced order models,multi-fidelity approaches)

2 Efficient and structure-exploiting sampling schemes

I This talk: relate to notions of coupling and transport!

Marzouk et al. MIT 4 / 44

Page 6: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Deterministic couplings of probability measures

�(�) p(r)

�̃(r)

T (�)

T̃ (�)

Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure � to the standard Gaussianreference p while the approximate map only captures some of the structure in �,producing an approximation p̃ to the reference Gaussian.

µr does not contain any point masses and the cost function c(�, T (�)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].

Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when� � N(0, I) and r � N(0, �) for some covariance matrix �. In this Gaussian example,

the transport map will be linear: ri.d.= �1/2�, where �1/2 is any one of the many

square roots of �. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by

cEig(�, T (�)) = �� � T (�)�2, (2.3)

the optimal square root, �1/2, will be defined by the eigenvalue decomposition of �,but when the cost is given by the limit of a a weighted quadratic defined by

cRos(�, T (�)) = limt�0

DX

k=1

tk�1|�k � Tk(�)|, (2.4)

the optimal square root, �1/2, will be defined by the Cholesky decomposition of �.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].

The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting

31

�(�) p(r)

�̃(r)

T (�)

T̃ (�)

Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure � to the standard Gaussianreference p while the approximate map only captures some of the structure in �,producing an approximation p̃ to the reference Gaussian.

µr does not contain any point masses and the cost function c(�, T (�)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].

Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when� � N(0, I) and r � N(0, �) for some covariance matrix �. In this Gaussian example,

the transport map will be linear: ri.d.= �1/2�, where �1/2 is any one of the many

square roots of �. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by

cEig(�, T (�)) = �� � T (�)�2, (2.3)

the optimal square root, �1/2, will be defined by the eigenvalue decomposition of �,but when the cost is given by the limit of a a weighted quadratic defined by

cRos(�, T (�)) = limt�0

DX

k=1

tk�1|�k � Tk(�)|, (2.4)

the optimal square root, �1/2, will be defined by the Cholesky decomposition of �.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].

The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting

31

T

η π

Core idea

I Choose a reference distribution ⌘ (e.g., standard Gaussian)I Seek a transport map T : Rn ! Rn such that T]⌘ = ⇡

Marzouk et al. MIT 5 / 44

Page 7: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Deterministic couplings of probability measures

�(�) p(r)

�̃(r)

T (�)

T̃ (�)

Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure � to the standard Gaussianreference p while the approximate map only captures some of the structure in �,producing an approximation p̃ to the reference Gaussian.

µr does not contain any point masses and the cost function c(�, T (�)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].

Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when� � N(0, I) and r � N(0, �) for some covariance matrix �. In this Gaussian example,

the transport map will be linear: ri.d.= �1/2�, where �1/2 is any one of the many

square roots of �. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by

cEig(�, T (�)) = �� � T (�)�2, (2.3)

the optimal square root, �1/2, will be defined by the eigenvalue decomposition of �,but when the cost is given by the limit of a a weighted quadratic defined by

cRos(�, T (�)) = limt�0

DX

k=1

tk�1|�k � Tk(�)|, (2.4)

the optimal square root, �1/2, will be defined by the Cholesky decomposition of �.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].

The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting

31

�(�) p(r)

�̃(r)

T (�)

T̃ (�)

Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure � to the standard Gaussianreference p while the approximate map only captures some of the structure in �,producing an approximation p̃ to the reference Gaussian.

µr does not contain any point masses and the cost function c(�, T (�)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].

Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when� � N(0, I) and r � N(0, �) for some covariance matrix �. In this Gaussian example,

the transport map will be linear: ri.d.= �1/2�, where �1/2 is any one of the many

square roots of �. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by

cEig(�, T (�)) = �� � T (�)�2, (2.3)

the optimal square root, �1/2, will be defined by the eigenvalue decomposition of �,but when the cost is given by the limit of a a weighted quadratic defined by

cRos(�, T (�)) = limt�0

DX

k=1

tk�1|�k � Tk(�)|, (2.4)

the optimal square root, �1/2, will be defined by the Cholesky decomposition of �.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].

The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting

31

S =T−1

η π

Core idea

I Choose a reference distribution ⌘ (e.g., standard Gaussian)I Seek a transport map T : Rn ! Rn such that T]⌘ = ⇡

I Equivalently, find S = T

�1 such that S]⇡ = ⌘

Marzouk et al. MIT 5 / 44

Page 8: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Deterministic couplings of probability measures

�(�) p(r)

�̃(r)

T (�)

T̃ (�)

Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure � to the standard Gaussianreference p while the approximate map only captures some of the structure in �,producing an approximation p̃ to the reference Gaussian.

µr does not contain any point masses and the cost function c(�, T (�)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].

Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when� � N(0, I) and r � N(0, �) for some covariance matrix �. In this Gaussian example,

the transport map will be linear: ri.d.= �1/2�, where �1/2 is any one of the many

square roots of �. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by

cEig(�, T (�)) = �� � T (�)�2, (2.3)

the optimal square root, �1/2, will be defined by the eigenvalue decomposition of �,but when the cost is given by the limit of a a weighted quadratic defined by

cRos(�, T (�)) = limt�0

DX

k=1

tk�1|�k � Tk(�)|, (2.4)

the optimal square root, �1/2, will be defined by the Cholesky decomposition of �.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].

The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting

31

�(�) p(r)

�̃(r)

T (�)

T̃ (�)

Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure � to the standard Gaussianreference p while the approximate map only captures some of the structure in �,producing an approximation p̃ to the reference Gaussian.

µr does not contain any point masses and the cost function c(�, T (�)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].

Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when� � N(0, I) and r � N(0, �) for some covariance matrix �. In this Gaussian example,

the transport map will be linear: ri.d.= �1/2�, where �1/2 is any one of the many

square roots of �. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by

cEig(�, T (�)) = �� � T (�)�2, (2.3)

the optimal square root, �1/2, will be defined by the eigenvalue decomposition of �,but when the cost is given by the limit of a a weighted quadratic defined by

cRos(�, T (�)) = limt�0

DX

k=1

tk�1|�k � Tk(�)|, (2.4)

the optimal square root, �1/2, will be defined by the Cholesky decomposition of �.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].

The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting

31

S =T−1

η π

Core idea

I Choose a reference distribution ⌘ (e.g., standard Gaussian)I Seek a transport map T : Rn ! Rn such that T]⌘ = ⇡

I Equivalently, find S = T

�1 such that S]⇡ = ⌘

I Enables exact (independent, unweighted) sampling!

Marzouk et al. MIT 5 / 44

Page 9: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Deterministic couplings of probability measures

�(�) p(r)

�̃(r)

T (�)

T̃ (�)

Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure � to the standard Gaussianreference p while the approximate map only captures some of the structure in �,producing an approximation p̃ to the reference Gaussian.

µr does not contain any point masses and the cost function c(�, T (�)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].

Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when� � N(0, I) and r � N(0, �) for some covariance matrix �. In this Gaussian example,

the transport map will be linear: ri.d.= �1/2�, where �1/2 is any one of the many

square roots of �. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by

cEig(�, T (�)) = �� � T (�)�2, (2.3)

the optimal square root, �1/2, will be defined by the eigenvalue decomposition of �,but when the cost is given by the limit of a a weighted quadratic defined by

cRos(�, T (�)) = limt�0

DX

k=1

tk�1|�k � Tk(�)|, (2.4)

the optimal square root, �1/2, will be defined by the Cholesky decomposition of �.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].

The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting

31

�(�) p(r)

�̃(r)

T (�)

T̃ (�)

Figure 2-1: Illustration of exact and inexact transformations coming from T and T̃respectively. The exact map pushes the target measure � to the standard Gaussianreference p while the approximate map only captures some of the structure in �,producing an approximation p̃ to the reference Gaussian.

µr does not contain any point masses and the cost function c(�, T (�)) is quadratic.Details of the existence and uniqueness proofs can also be found in [102].

Being a form of regularization, the cost function in (2.2) defines the form andstructure of the optimal transport map. For illustration, consider the case when� � N(0, I) and r � N(0, �) for some covariance matrix �. In this Gaussian example,

the transport map will be linear: ri.d.= �1/2�, where �1/2 is any one of the many

square roots of �. Two possible matrix square roots are the Cholesky factor, and theeigenvalue square root. Interestingly, when the cost is given by

cEig(�, T (�)) = �� � T (�)�2, (2.3)

the optimal square root, �1/2, will be defined by the eigenvalue decomposition of �,but when the cost is given by the limit of a a weighted quadratic defined by

cRos(�, T (�)) = limt�0

DX

k=1

tk�1|�k � Tk(�)|, (2.4)

the optimal square root, �1/2, will be defined by the Cholesky decomposition of �.In the more general nonlinear and non-Gaussian setting, this latter cost is shown by[22] and [15] to yield the well-known Rosenblatt transformation from [91].

The Cholesky factor is a special case of the Rosenblatt transformation, which it-self is just a multivariate generalization of using cumulative distribution functions totransform between univariate random variables (i.e., the “CDF trick”). Importantly,the lower triangular structure present in the Cholesky factor, which makes inverting

31

S =T−1

η π

Core idea

I Choose a reference distribution ⌘ (e.g., standard Gaussian)I Seek a transport map T : Rn ! Rn such that T]⌘ = ⇡

I Equivalently, find S = T

�1 such that S]⇡ = ⌘

I Satisfying these conditions only approximately may still be useful!

Marzouk et al. MIT 5 / 44

Page 10: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Topics for this talk

Three vignettes on transport in Bayesian computation:

1 Variational Bayesian inference

2 Accelerating Markov chain Monte Carlo

3 Nonlinear ensemble filtering and map estimation

Marzouk et al. MIT 6 / 44

Page 11: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Choice of transport map

A useful building block is the Knothe-Rosenblatt rearrangement:

T (x) =

2

6664

T

1(x1

)T

2(x1

, x2

)...T

n(x1

, x2

, . . . , xn

)

3

7775

I Exists and is unique (up to ordering) under mild conditions on ⌘,⇡I Jacobian determinant easy to evaluateI “Exposes” marginals, enables conditional sampling. . .

I Numerical approximations can employ a monotone parameterization

guaranteeing @x

k

T

k > 0; for example,

T

k(x1

, . . . , xk

) = a

k

(x1

, . . . , xk�1

)+

Zx

k

0

exp(bk

(x1

, . . . , xk�1

,w)) dw

Marzouk et al. MIT 7 / 44

Page 12: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Choice of transport map

A useful building block is the Knothe-Rosenblatt rearrangement:

T (x) =

2

6664

T

1(x1

)T

2(x1

, x2

)...T

n(x1

, x2

, . . . , xn

)

3

7775

I Exists and is unique (up to ordering) under mild conditions on ⌘,⇡I Jacobian determinant easy to evaluateI “Exposes” marginals, enables conditional sampling. . .

I Numerical approximations can employ a monotone parameterization

guaranteeing @x

k

T

k > 0; for example,

T

k(x1

, . . . , xk

) = a

k

(x1

, . . . , xk�1

)+

Zx

k

0

exp(bk

(x1

, . . . , xk�1

,w)) dw

Marzouk et al. MIT 7 / 44

Page 13: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Variational inference

Variational characterization of the direct map T [Moselhy & M 2012]:

minT2T4

DKL

(T] ⌘ ||⇡ ) = minT2T4

DKL

( ⌘ ||T�1

] ⇡ )

I T4 is the set of monotone lower triangular mapsI Contains the Knothe-Rosenblatt rearrangement

I Expectation is with respect to the reference measure ⌘I Compute via, e.g., Monte Carlo, sparse quadrature

I Use unnormalized evaluations of ⇡ and its gradientsI No MCMC or importance sampling

Marzouk et al. MIT 8 / 44

Page 14: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Simple example

minT

E⌘[� log⇡ � T �X

k

log @x

k

T

k ]

I Parameterized map T 2 T h

4 ⇢ T4I Optimize over coefficients of

parameterizationI Use gradient-based optimizationI The posterior is in the tail of the reference

Marzouk et al. MIT 9 / 44

Page 15: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Simple example

minT

E⌘[� log⇡ � T �X

k

log @x

k

T

k ]

I Parameterized map T 2 T h

4 ⇢ T4I Optimize over coefficients of

parameterizationI Use gradient-based optimizationI The posterior is in the tail of the reference

Marzouk et al. MIT 9 / 44

Page 16: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Simple example

minT

E⌘[� log⇡ � T �X

k

log @x

k

T

k ]

I Parameterized map T 2 T h

4 ⇢ T4I Optimize over coefficients of

parameterizationI Use gradient-based optimizationI The posterior is in the tail of the reference

Marzouk et al. MIT 9 / 44

Page 17: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Simple example

minT

E⌘[� log⇡ � T �X

k

log @x

k

T

k ]

I Parameterized map T 2 T h

4 ⇢ T4I Optimize over coefficients of

parameterizationI Use gradient-based optimizationI The posterior is in the tail of the reference

Marzouk et al. MIT 9 / 44

Page 18: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Simple example

minT

E⌘[� log⇡ � T �X

k

log @x

k

T

k ]

I Parameterized map T 2 T h

4 ⇢ T4I Optimize over coefficients of

parameterizationI Use gradient-based optimizationI The posterior is in the tail of the reference

Marzouk et al. MIT 9 / 44

Page 19: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Useful features

I Move samples; don’t just reweigh themI

Independent and cheap samples: x

i

⇠ ⌘ ) T

(x

i

)

I Clear convergence criterion, even with unnormalized target density:

DKL

(T] ⌘ ||⇡ ) ⇡12Var⌘

"

log⌘

T

�1

] ⇡̄

#

I Can either accept bias or reduce it by:I Increasing the complexity of the map T 2 T h

4I Sampling the pullback T

�1

] ⇡ using MCMC or importance sampling

I Many recent constructions also employ transport for variational inference(Stein variational gradient descent [Liu & Wang 2016], normalizing flows[Rezende & Mohamed 2015]) or for sampling (Gibbs flows [Heng et al.

2015], particle flow filter [Reich 2011], implicit sampling [Chorin et al.

2009–2015])

Marzouk et al. MIT 10 / 44

Page 20: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Useful features

I Move samples; don’t just reweigh themI

Independent and cheap samples: x

i

⇠ ⌘ ) T

(x

i

)

I Clear convergence criterion, even with unnormalized target density:

DKL

(T] ⌘ ||⇡ ) ⇡12Var⌘

"

log⌘

T

�1

] ⇡̄

#

I Can either accept bias or reduce it by:I Increasing the complexity of the map T 2 T h

4I Sampling the pullback T

�1

] ⇡ using MCMC or importance sampling

I Many recent constructions also employ transport for variational inference(Stein variational gradient descent [Liu & Wang 2016], normalizing flows[Rezende & Mohamed 2015]) or for sampling (Gibbs flows [Heng et al.

2015], particle flow filter [Reich 2011], implicit sampling [Chorin et al.

2009–2015])

Marzouk et al. MIT 10 / 44

Page 21: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Useful features

I Move samples; don’t just reweigh themI

Independent and cheap samples: x

i

⇠ ⌘ ) T

(x

i

)

I Clear convergence criterion, even with unnormalized target density:

DKL

(T] ⌘ ||⇡ ) ⇡12Var⌘

"

log⌘

T

�1

] ⇡̄

#

I Can either accept bias or reduce it by:I Increasing the complexity of the map T 2 T h

4I Sampling the pullback T

�1

] ⇡ using MCMC or importance sampling

I Many recent constructions also employ transport for variational inference(Stein variational gradient descent [Liu & Wang 2016], normalizing flows[Rezende & Mohamed 2015]) or for sampling (Gibbs flows [Heng et al.

2015], particle flow filter [Reich 2011], implicit sampling [Chorin et al.

2009–2015])

Marzouk et al. MIT 10 / 44

Page 22: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Low-dimensional structure

I Key challenge: maps in high dimensionsI Major bottleneck: representation of the map, e.g., cardinality of the

map basisI How to make the construction/representation of high-dimensional

transports tractable?

I Main idea: exploit Markov structure of the target distributionI Leads to various low-dimensional properties of transport maps

[Spantini, Bigoni, & M JMLR 2018]:1 Decomposability2 Sparsity3 Low rank

Marzouk et al. MIT 11 / 44

Page 23: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Low-dimensional structure

I Key challenge: maps in high dimensionsI Major bottleneck: representation of the map, e.g., cardinality of the

map basisI How to make the construction/representation of high-dimensional

transports tractable?

I Main idea: exploit Markov structure of the target distributionI Leads to various low-dimensional properties of transport maps

[Spantini, Bigoni, & M JMLR 2018]:1 Decomposability2 Sparsity3 Low rank

Marzouk et al. MIT 11 / 44

Page 24: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Markov random fields

I Let Z

1

, . . . ,Zn

be random variables with joint density ⇡ > 0

A BS

1

(i , j) /2 E iff Z

i

?? Z

j

|ZV\{i ,j}

I G encodes conditional independence (an I -map for ⇡)

Marzouk et al. MIT 12 / 44

Page 25: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Decomposable transport maps

I Definition: a decomposable transport is a map T = T

1

� · · · � T

k

that factorizes as the composition of finitely many maps of loweffective dimension that are triangular (up to a permutation), e.g.,

T (x) =

2

6666664

A

1

(x1

, x2

, x3

)B

1

(x2

, x3

)C

1

(x3

)x4

x5

x6

3

7777775

| {z }T

1

2

6666664

x1

A

2

(x2

, x3

, x4

, x5

)B

2

(x3

, x4

, x5

)C

2

(x4

, x5

)D

2

(x5

)x6

3

7777775

| {z }T

2

2

6666664

x1

x2

x3

A

3

(x4

)B

3

(x4

, x5

)C

3

(x4

, x5

, x6

)

3

7777775

| {z }T

3

I Theorem: [Spantini et al. (2018)] Decomposable graphical models for ⇡lead to decomposable direct maps T , provided that ⌘(x) =

Qi

⌘(xi

)

Marzouk et al. MIT 13 / 44

Page 26: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Transport maps and graphical models

Key message

I Enforce decomposable structure in the approximation space T4,i.e., when solving min

T2T4 DKL

(T]⌘ ||⇡ )

I A general tool for modeling and computation with non-GaussianMarkov random fields

I In many situations, elements of the compositionT = T

1

� T

2

� · · · � T

k

can be constructed sequentially

Marzouk et al. MIT 14 / 44

Page 27: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Application to state-space models

I Nonlinear/non-Gaussian state-space model with static params ⇥:I Transition density ⇡

Z

k

|Zk�1

,⇥

I Observation density (likelihood) ⇡Y

k

|Zk

Z0 Z1 Z2 Z3 ZN

Y0 Y1 Y2 Y3 YN

1

I Interested in recursively updating the full Bayesian solution:⇡

Z

0:k ,⇥ | y0:k! ⇡

Z

0:k+1

,⇥ | y0:k+1

(smoothing + sequential parameter inference)

Marzouk et al. MIT 15 / 44

Page 28: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Example: stochastic volatility model

I Stochastic volatility model: Latent log-volatilities take the form ofan AR(1) process for t = 1, . . . ,N:

Z

t+1

= µ+ � (Zt

� µ) + ⌘t

, ⌘t

⇠ N (0, 1), Z

1

⇠ N (0, 1/1� �2)

I Observe the mean return for holding an asset at time t

Y

t

= "t

exp( 0.5 Z

t

), "t

⇠ N (0, 1), t = 1, . . . ,N

I Markov structure for ⇡ ⇠ µ,�,Z1:N |Y1:N is given by:

A BS1 2 3 4 N

µ „

1

Marzouk et al. MIT 16 / 44

Page 29: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Example: stochastic volatility model

I Build the decomposition recursivelyT = Id�T

1

�TN�1

A BS1 2 3 4 N

µ „

1

I Figure: Markov structure for the pullback of ⇡ through T

I Start with the identity map

Marzouk et al. MIT 17 / 44

Page 30: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Stochastic volatility model

I Build the decomposition recursivelyT = Id�T

1

�TN�1

A BS1 2 3 4 N

µ „

1

I Figure: Markov structure for the pullback of ⇡ through T

I Find a good first decomposition of G

Marzouk et al. MIT 17 / 44

Page 31: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Stochastic volatility model

I Build the decomposition recursivelyT = T

0

�Id�TN�1

A BS1 2 3 4 N

µ „

1

I Figure: Markov structure for the pullback of ⇡ through T

I Compute an (essentially) 4-D T

0

and pull back ⇡I Underlying approximation of µ,�,Z

1

|Y1

Marzouk et al. MIT 17 / 44

Page 32: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Stochastic volatility model

I Build the decomposition recursivelyT = T

0

�Id�TN�1

A A BS1 2 3 4 N

µ „

1

I Figure: Markov structure for the pullback of ⇡ through T

I Find a new decompositionI Underlying approximation of µ,�,Z

1

|Y1

Marzouk et al. MIT 17 / 44

Page 33: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Stochastic volatility model

I Build the decomposition recursivelyT = T

0

� T

1

�Id�TN�1

A A BS1 2 3 4 N

µ „

1

I Figure: Markov structure for the pullback of ⇡ through T

I Compute an (essentially) 4-D T

1

and pull back ⇡I Underlying approximation of µ,�,Z

1:2|Y1:2

Marzouk et al. MIT 17 / 44

Page 34: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Stochastic volatility model

I Build the decomposition recursivelyT = T

0

� T

1

�Id�TN�1

A A BS1 2 3 4 N

µ „

1

I Figure: Markov structure for the pullback of ⇡ through T

I Continue the recursion until no edges are left. . .I Underlying approximation of µ,�,Z

1:2|Y1:2

Marzouk et al. MIT 17 / 44

Page 35: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Stochastic volatility model

I Build the decomposition recursivelyT = T

0

� T

1

� T

2

�Id � T

N�1

A A BS1 2 3 4 N

µ „

1

I Figure: Markov structure for the pullback of ⇡ through T

I Continue the recursion until no edges are left. . .I Underlying approximation of µ,�,Z

1:3|Y1:3

Marzouk et al. MIT 17 / 44

Page 36: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Stochastic volatility model

I Build the decomposition recursivelyT = T

0

� T

1

� T

2

� · · · � T

N�3

�Id

A A1 2 3 4 N

µ „

1

I Figure: Markov structure for the pullback of ⇡ through T

I Continue the recursion until no edges are left. . .I Underlying approximation of µ,�,Z

1:N�1

|Y1:N�1

Marzouk et al. MIT 17 / 44

Page 37: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Stochastic volatility model

I Build the decomposition recursivelyT = T

0

� T

1

� T

2

� · · · � T

N�3

� T

N�2

�Id

A A1 2 3 4 N

µ „

1

I Figure: Markov structure for the pullback of ⇡ through T

I Each map T

k

is essentially 4-D regardless of N

I Underlying approximation of µ,�,Z1:N |Y1:N

Marzouk et al. MIT 17 / 44

Page 38: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

The decomposable map

T(x) =

2

666666666664

P

0

(x✓)A

0

(x✓, x0

, x1

)B

0

(x✓, x1

)x2

x3

x4

...xN

3

777777777775

| {z }T

0

2

666666666664

P

1

(x✓)x0

A

1

(x✓, x1

, x2

)B

1

(x✓, x2

)x3

x4

...xN

3

777777777775

| {z }T

1

2

666666666664

P

2

(x✓)x0

x1

A

2

(x✓, x2

, x3

)B

2

(x✓, x3

)x4

...xN

3

777777777775

| {z }T

2

� · · ·

I (P0

� · · · � P

k

)] ⌘⇥ = ⇡⇥ |Y0:k+1

(parameter inference)

Marzouk et al. MIT 18 / 44

Page 39: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Stochastic volatility example

I Infer log-volatility of the pound/dollar exchange rate, starting on 1October 1981

I Filtering (blue) versus smoothing (red) marginals

Marzouk et al. MIT 19 / 44

Page 40: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Smoothing marginals

I Just re-evaluate the 4-D maps backwards in timeI Comparison with a “reference” MCMC solution with 105 ESS (in red)

Marzouk et al. MIT 20 / 44

Page 41: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Static parameter �

I Sequential parameter inferenceI Comparison with a “reference” MCMC solution (batch algorithm)

Marzouk et al. MIT 21 / 44

Page 42: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Static parameter µ

I Slow accumulation of error over time (sequential algorithm)I Acceptance rate 75% for Metropolis independence sampler with

transport proposal

Marzouk et al. MIT 22 / 44

Page 43: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Long-time smoothing (25 years)

9/11

Lehman Brothers bankrupcy

Brexit referendum

I Python code available at http://transportmaps.mit.edu

Page 44: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Vignette #2: Transport + MCMC

I In general, the variational approach yields an approximation T 2 T h

4of the exact transport map

I Can we still achieve exact posterior exampling?I Are very cheap or crude approximations still useful?

Key idea: combining map construction with Markov chain Monte Carlo(MCMC)I Posterior sampling + convex optimizationI Transport map “preconditions” MCMC sampling; posterior samples

enable map constructionI Can be understood in the framework of adaptive MCMC

Marzouk et al. MIT 24 / 44

Page 45: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Preconditioning MCMC

I MCMC algorithms are a workhorse of Bayesian computation

I Effective = adapted to the targetI Can we transform proposals or targets for better sampling?

Marzouk et al. MIT 25 / 44

Page 46: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Constructing a transport map from samples

I Seek inverse transport, from target to referenceI Candidate map S yields an approximation S

�1

] ⇡ref

of the targetI Variational characterization of the map:

minS2S4

DKL

(S] ⇡tar

||⇡ref

) = minS2S4

DKL

(⇡tar

||S�1

] ⇡ref

)

r

⇡ref = N(0, I)

S(✓)

⇡tar(✓)

S�1] ⇡ref

Marzouk et al. MIT 26 / 44

Page 47: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Constructing a transport map from samples

I Additional structureI Choose ⇡

ref

= ⌘ to be standard GaussianI Seek a monotone lower triangular map S 2 S4I Samples ✓(i) ⇠ ⇡

tar

approximate the expectationI Yields a convex and separable optimization problem

arg minS2S4

DKL

(⇡tar

||S�1

] ⌘ ) = arg maxS2S4

E⇡tar

[log ⌘ � S + log detrS

]

I Sample-average approximation for each map component S

k ,k = 1 . . . n:

maxS

k

1M

MX

i=1

�12

⇣S

k(✓(i))⌘

2

+ log @k

S

k(✓(i))

I Equivalent to maximum likelihood estimation for S

I Parameterize a finite space of monotone triangular maps Sh

4 andoptimize over coefficients

Marzouk et al. MIT 27 / 44

Page 48: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Constructing a transport map from samples

I Additional structureI Choose ⇡

ref

= ⌘ to be standard GaussianI Seek a monotone lower triangular map S 2 S4I Samples ✓(i) ⇠ ⇡

tar

approximate the expectationI Yields a convex and separable optimization problem

arg minS2S4

DKL

(⇡tar

||S�1

] ⌘ ) = arg maxS2S4

E⇡tar

[log ⌘ � S + log detrS

]

I Sample-average approximation for each map component S

k ,k = 1 . . . n:

maxS

k

1M

MX

i=1

�12

⇣S

k(✓(i))⌘

2

+ log @k

S

k(✓(i))

I Equivalent to maximum likelihood estimation for S

I Parameterize a finite space of monotone triangular maps Sh

4 andoptimize over coefficients

Marzouk et al. MIT 27 / 44

Page 49: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Constructing a transport map from samples

I Additional structureI Choose ⇡

ref

= ⌘ to be standard GaussianI Seek a monotone lower triangular map S 2 S4I Samples ✓(i) ⇠ ⇡

tar

approximate the expectationI Yields a convex and separable optimization problem

arg minS2S4

DKL

(⇡tar

||S�1

] ⌘ ) = arg maxS2S4

E⇡tar

[log ⌘ � S + log detrS

]

I Sample-average approximation for each map component S

k ,k = 1 . . . n:

maxS

k

1M

MX

i=1

�12

⇣S

k(✓(i))⌘

2

+ log @k

S

k(✓(i))

I Equivalent to maximum likelihood estimation for S

I Parameterize a finite space of monotone triangular maps Sh

4 andoptimize over coefficients

Marzouk et al. MIT 27 / 44

Page 50: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

• Ingredient #1: static map– Idea: perform MCMC in the reference space, on a

“preconditioned” density– Simple proposal in reference space (e.g., random walk)

corresponds to a more complex/tailored proposal on target

r

p̃(r)

T̃ (✓)⇡(✓)

Map-accelerated MCMC

S(θ)

Page 51: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

• Ingredient #1: static map– Idea: perform MCMC in the reference space, on a

“preconditioned” density– Simple proposal in reference space (e.g., random walk)

corresponds to a more complex/tailored proposal on target

α =π(S−1( ′r )) |∇S−1 | ′r q

r(r | ′r )

π(S−1(r)) |∇S−1 |rq

r( ′r | r)

simple proposal qr on pushforward of target through map

Map-accelerated MCMC

r

p̃(r)

S(✓)⇡(✓)

rr�

qr(r�|r)

✓�

Page 52: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

• Ingredient #1: static map– Idea: perform MCMC in the reference space, on a

“preconditioned” density– Simple proposal in reference space (e.g., random walk)

corresponds to a more complex/tailored proposal on target

α =π(S−1( ′r )) |∇S−1 | ′r q

r(r | ′r )

π(S−1(r)) |∇S−1 |rq

r( ′r | r)

more complex proposal, directly on target distribution

Map-accelerated MCMC

r

p̃(r)

S(✓)⇡(✓)

rr�

qr(r�|r)

✓�

Page 53: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

• Ingredient #2: adaptive map– Update the map with each MCMC iteration:

more samples from π, more accurate , better– Adaptive MCMC [Haario 2001, Andrieu 2006], but with

nonlinear transformation to capture non-Gaussian structure

S

r

T̃k(✓)

p̃k(r)

⇡(✓)

Map-accelerated MCMC

Sk(θ)

Page 54: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

r

✓T̃k+1(✓)

⇡(✓)

p̃k+1(r)

• Ingredient #2: adaptive map– Update the map with each MCMC iteration:

more samples from π, more accurate , better– Adaptive MCMC [Haario 2001, Andrieu 2006], but with

nonlinear transformation to capture non-Gaussian structure

Map-accelerated MCMC

S

Sk+1(θ)

Page 55: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

• Ingredient #3: global proposals– If the map becomes sufficiently accurate, would like to avoid

random-walk behavior

Map-accelerated MCMC

reference RW proposal mapped RW proposal

qr( ′r | r) = N(r,σ2I ) qθ( ′θ | θ) = qr S( ′θ ) |S(θ)( )∇S( ′θ )

Page 56: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

• Ingredient #3: global proposals– If the map becomes sufficiently accurate, would like to avoid

random-walk behavior

Map-accelerated MCMC

reference independence proposal mapped independence proposal

qr( ′r ) = N(0,I ) qθ( ′θ ) = qr S( ′θ )( )∇S( ′θ )

Page 57: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

• Ingredient #3: global proposals– If the map becomes sufficiently accurate, would like to avoid

random-walk behavior– Solution: delayed rejection MCMC [Mira 2001]– First proposal = independent sample from η (global, more

efficient); second proposal = random walk (local, more robust)

• Entire scheme is provably ergodic with respect to the exact posterior measure [Parno & M, SIAM JUQ 2018]– Requires enforcing some regularity conditions on maps, to preserve

tail behavior of transformed target

Map-accelerated MCMC

Page 58: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Example: biological oxygen demand model

I Likelihood model:d = ✓

1

(1� exp (�✓2

x)) + ✏

✏ ⇠ N

�0, 2⇥ 10�4

I 20 noisy observations at

x =

⇢55,65, . . . ,

255

I Degree-three polynomial map

True posterior density

Marzouk et al. MIT 28 / 44

Page 59: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Results: ESS per computational effort

TM+DR

TM+NUTS

TM+LANNUTS

DRAM0

100

200

300

161

57

179

1.4 5.8

ESS/(1,000 Evaluations) – ✓1

TM+DRG

TM+DRL

TM+MIXNUTS

DRAM0

500

1,000

1,500 1468

487

1495

57127

ESS/second– ✓1

Marzouk et al. MIT 29 / 44

Page 60: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Transformed distribution

Original posterior ⇡tar

Pushforward posterior S]⇡tar

Marzouk et al. MIT 30 / 44

Page 61: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Example: maple sap dynamics model

I Coupled PDE system forice, water, and gaslocations [Ceseri &Stockie 2013]

I Measure gas pressure invessel

I Infer 10 physical modelparameters

I Very challengingposterior!

Copyright © by SIAM. Unauthorized reproduction of this article is prohibited.

654 MAURIZIO CESERI AND JOHN M. STOCKIE

R +R+rRs0

x

R+R−rs R+2R

water

fiber/vessel vesselwall

f

fiber

R

v

sgisiw

gasice

water

gas

gi iwf f v f v f v

r

R

Fig. 3.1. (Bottom) A 2D cross-section through a fiber-vessel pair showing the water, ice, andgas regions and the moving interfaces. (Top) The 1D region corresponding to our simplified modelgeometry is outlined with a red (dashed line) box in the bottom of the figure. The locations of thevarious phase boundaries are indicated, with the origin x = 0 at the center of the fiber.

gas is also present in the vessel. Further justification for this assumption isprovided in the papers [25, 30], which indicate that maple trees experiencewinter embolism (bubble formation) when gas in the vessels is forced out ofsolution upon freezing.

A8. Gas in both fiber and vessel takes the form of a cylindrical bubble locatedat the center of the corresponding cell. This seems reasonable in the vesselwhere the surface tension is of the order of �/Rv � 4 � 103 Pa, which isseveral orders of magnitude smaller than the typical gas and liquid pressures.In the fiber, the smaller radius gives rise to a much larger surface tension(�/Rf � 2 � 104 Pa) which though still small relative to gas and liquidpressures could still potentially initiate a break-up into smaller bubbles owingto the Plateau–Rayleigh instability [7]. However, regardless of the preciseconfiguration of the gas in the fiber, we assume that the net e�ect on gaspressure is equivalent to that of a single large bubble.

A9. Heat from outside the tree enters from the right in Figure 3.1. Consequently,the sap in the vessel is taken to be in liquid form initially, and the ice in thefiber begins melting on the inner surface of the fiber wall.

A10. Gas and ice temperatures in the fiber can be taken as constant and equal tothe freezing point. This is justified by the fiber length scale being so muchsmaller than that of the vessel (Rf � Rv).

A11. Time scales for heat and gas di�usion are much shorter than those corre-sponding to ice melting and subsequent phase interface motion (which oursimulations show are on the order of minutes to hours). This is justified byconsidering the various di�usion coe�cients (D) and a typical length scale

Dow

nloa

ded

02/2

6/15

to 1

8.10

1.8.

142.

Red

istri

butio

n su

bjec

t to

SIA

M li

cens

e or

cop

yrig

ht; s

ee h

ttp://

ww

w.si

am.o

rg/jo

urna

ls/o

jsa.

php

Image from Ceseri and Stockie, 2013

Marzouk et al. MIT 31 / 44

Page 62: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Maple posterior distribution

✓1

✓2

✓3

✓4

✓5

✓6

✓7

✓8

✓9

✓10

�1 �0.5 0 0.5

0

1

2

✓1

✓ 7

�1 �0.5 0 0.5 1�0.6

�0.4

�0.2

0

0.2

0.4

✓3

✓ 6

Marzouk et al. MIT 32 / 44

Page 63: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Results: ESS per computational effort

TM+DRG

TM+DRL

TM+MIXDRAM

0

2

4

6

8

10

5.7

10

2.9

0.6

ESS/(10,000 Evaluations)

TM+DRG

TM+DRL

TM+MIXDRAM

0

5

10

15

20

25

18

26

7.1

2.3

ESS/(1000 seconds)

Marzouk et al. MIT 33 / 44

Page 64: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Comments on MCMC with transport maps

Useful characteristics of the algorithm:I Map construction is easily parallelizableI Requires no gradients from posterior density

Generalizes many current MCMC techniques:I Adaptive Metropolis: map enables non-Gaussian proposals and a

natural mixing between local and global movesI Manifold MCMC [Girolami & Calderhead 2011]: map also defines a

Riemannian metric

Marzouk et al. MIT 34 / 44

Page 65: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Vignette #3: ensemble filtering and map estimation

Z0 Z1 Z2 Z3 ZN

Y0 Y1 Y2 Y3 YN

1

I Consider the filtering of state-space models with:1 High-dimensional states2 Challenging nonlinear dynamics (e.g., chaotic systems)3 Intractable transition kernels: can simulate from ⇡

Z

k+1

|Zk

but cannotevaluate its density

4 Limited model evaluations, e.g., small ensemble sizes5 Sparse and local observations in space/time

I These constraints reflect typical challenges faced in numericalweather prediction, geophysical data assimilation

Marzouk et al. MIT 35 / 44

Page 66: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Ensemble Kalman filter

I State-of-the-art results (in terms of tracking) are typically obtainedwith the ensemble Kalman filter (EnKF)

⇡Zk�1|Y 0:k�1

forecast step analysis step

Bayesian inference

⇡Zk|Y 0:k�1⇡Zk|Y 0:k

I Move samples via a linear transformation; no weights or resampling!I Yet ultimately inconsistent: does not converge to the true posterior

Can we generalize the EnKF while preserving scalability, via nonlin-ear transformations?

Marzouk et al. MIT 36 / 44

Page 67: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Ensemble Kalman filter

I State-of-the-art results (in terms of tracking) are typically obtainedwith the ensemble Kalman filter (EnKF)

⇡Zk�1|Y 0:k�1

forecast step analysis step

Bayesian inference

⇡Zk|Y 0:k�1⇡Zk|Y 0:k

I Move samples via a linear transformation; no weights or resampling!I Yet ultimately inconsistent: does not converge to the true posterior

Can we generalize the EnKF while preserving scalability, via nonlin-ear transformations?

Marzouk et al. MIT 36 / 44

Page 68: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Inference as a transportation of measures

I Seek a map T that pushes forward prior to posterior(x

1

, . . . , xM

) ⇠ ⇡X

=) (T (x

1

), . . . ,T (xN

)) ⇠ ⇡X|Y=y

I The map induces a coupling between prior and posterior measures

xi

T (xi)

⇡X|Y =y⇤⇡X

transport map

How to construct a “good” coupling from very few prior samples?

Marzouk et al. MIT 37 / 44

Page 69: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

A novel filtering algorithm with maps

T (y,x)

⇡X|Y =y

⇤⇡X

⇡Y ,X

joint

xi

⇡Y |X=xi

Transport map ensemble filter1 Compute forecast ensemble x

1

, . . . , xM

2 Generate samples (yi

, xi

) from ⇡Y,X with y

i

⇠ ⇡Y|X=x

i

3 Build an estimator bT of T

4 Compute analysis ensemble as xai

= bT (y

i

, xi

) for i = 1, . . . ,M

Marzouk et al. MIT 38 / 44

Page 70: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Regularized map estimation

bS

k 2 arg minS

k2Sh

4,k

1M

MX

i=1

✓12S

k(xi

)2 � log @k

S

k(xi

)

I In general, solve via convex optimizationI Connection to EnKF: a linear parameterization of b

S

k yields aparticular form of EnKF with “perturbed observations”

I Choice of approximation space allows control of the bias andvariance of b

S

I Richer parameterizations yield less bias, but potentially higher variance

I Strategy in high dimensions: gradually introduce nonlinearities, alwaysimpose sparsityI Explicit link between sparsity of a nonlinear map S and conditional

independence in non-Gaussian graphical models [Spantini, Bigoni, M2018]

Marzouk et al. MIT 39 / 44

Page 71: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Lorenz 96 in chaotic regime (40-dimensional state)

I A hard test-case configuration [*Bengtsson et al. 2003]:

dZj

dt

= (Zj+1

� Zj�2

)Zj�1

� Zj

+ F , j = 1, . . . , 40

Yj

= Zj

+ Ej

, j = 1, 3, 5 . . . , 39

IF = 8 (chaotic) and E

j

⇠ N (0, 0.5) (small noise for PF)I Time between observations: �

obs

= 0.4 (large)I Results computed over 2000 assimilation cycles

#particles: 400 #particles: 200RMSE *EnKF ⇡EnKF MapF ⇡EnKF MapFmedian 0.88 0.77 0.61 0.79 0.66mean 0.97 0.84 0.65 0.86 0.73mad - 0.14 0.11 0.16 0.13std 0.35 0.30 0.21 0.31 0.31

I The nonlinear filter is ⇡ 25% more accurate in RMSE than EnKF

Page 72: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Lorenz 96: details on the filtering approximation

1

I Observations were assimilated one at a timeI Impose sparsity of the map with a 5-way interaction model (above)I Separable and nonlinear parameterization of each component

bS

k(xj

1

, . . . , xj

p

, xk

) = (xj

1

) + . . .+ (xj

p

) + (xk

),

where (x) = a

0

+ a

1

· x +P

i>1

a

i

exp(�(x � c

i

)2/�).I Much more general parameterizations are of course possible

Marzouk et al. MIT 41 / 44

Page 73: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Lorenz 96: tracking performance of the filter

I Simple and and localized nonlinearities can have much impact!

Marzouk et al. MIT 42 / 44

Page 74: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Conclusions

Bayesian inference through the construction of deterministic couplings:I Variational Bayesian inference; demonstrated for filtering, smoothing,

and sequential parameter inference

I Exploiting approximate maps: map-accelerated MCMCI New nonlinear ensemble filtering schemes

Ongoing work:I Error analysis of approximate filtering schemesI Sparse recovery of transport maps from few samplesI

Structure learning for continuous non-Gaussian Markov random fieldsI Mapping sparse quadrature or QMC schemesI Nonparametric transports and gradient flows (e.g., Stein variational

methods)I

Low-rank transports, likelihood-informed subspaces (LIS), etc.

Marzouk et al. MIT 43 / 44

Page 75: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

Conclusions

Bayesian inference through the construction of deterministic couplings:I Variational Bayesian inference; demonstrated for filtering, smoothing,

and sequential parameter inference

I Exploiting approximate maps: map-accelerated MCMCI New nonlinear ensemble filtering schemes

Ongoing work:I Error analysis of approximate filtering schemesI Sparse recovery of transport maps from few samplesI

Structure learning for continuous non-Gaussian Markov random fieldsI Mapping sparse quadrature or QMC schemesI Nonparametric transports and gradient flows (e.g., Stein variational

methods)I

Low-rank transports, likelihood-informed subspaces (LIS), etc.

Marzouk et al. MIT 43 / 44

Page 76: A tour of transport methods for Bayesian computation€¦ · Bayesian inference in large-scale models Observationsy Parameters x ⇡pos(x) := ⇡(x|y) / ⇡(y|x)⇡pr(x) | {z } Bayes’

References

I Preprint on the ensemble filtering scheme is forthcomingI A. Spantini, D. Bigoni, Y. Marzouk. “Inference via low-dimensional couplings.”

JMLR, 2018; arXiv:1703.06131I M. Parno, Y. Marzouk, “Transport map accelerated Markov chain Monte Carlo.”

SIAM JUQ 6: 645–682, 2018.I G. Detomasso, T. Cui, A. Spantini, Y. Marzouk, R. Scheichl, “A Stein variational

Newton method.” NIPS 2018, arXiv:1806.03085.I R. Morrison, R. Baptista, Y. Marzouk. “Beyond normality: learning sparse

probabilistic graphical models in the non-Gaussian setting.” NIPS 2017;arXiv:1711.00950

I Y. Marzouk, T. Moselhy, M. Parno, A. Spantini, “An introduction to sampling viameasure transport.” Handbook of Uncertainty Quantification, R. Ghanem, D.Higdon, H. Owhadi, eds. Springer (2016). arXiv:1602.05023. (broad introductionto transport for sampling)

I T. Moselhy, Y. Marzouk, “Bayesian inference with optimal maps.” J. Comp. Phys.,231: 7815–7850, 2012.

I Python code at http://transportmaps.mit.edu, map-accelerated MCMC inhttp://muq.mit.edu

Marzouk et al. MIT 44 / 44