meta particle flow for sequential bayesian inference › materials › slides › ht19-1...–the...

Meta Particle Flowfor Sequential Bayesian Inference

Le Song

Associate Professor, CSE

Associate Director, Machine Learning Center

Georgia Institute of Technology

Joint work with Xinshi Chen and Hanjun Dai

Bayesian Inference

Infer the posterior distribution of unknown parameter 𝒙 given

• Prior distribution 𝜋(𝑥)

• Likelihood function 𝑝(𝑜|𝑥)

• Observations 𝑜1, 𝑜2, … , 𝑜𝑚

𝑝 𝑥 𝑜1:𝑚 =1

𝑧𝜋 𝑥

𝑖=1

𝑚

𝑝(𝑜𝑖|𝑥)

𝑧 = 𝜋 𝑥

𝑖=1

𝑚

𝑝(𝑜𝑖|𝑥) 𝑑𝑥

Challenging computational problem for high dimensional 𝑥

𝑥

𝑜1 𝑜2 𝑜𝑚…… ……

Gaussian Mixture Model

• prior 𝑥1, 𝑥2 ∼ 𝜋 𝑥 = 𝒩(0, 𝐼)

• observations o|𝑥1, 𝑥2 ∼ 𝑝 𝑜 𝑥1, 𝑥2 =1

2𝒩 𝑥1, 1 +

1

2𝒩(𝑥1 + 𝑥2, 1)

• With 𝑥1, 𝑥2 = (1,−2), the resulting posterior will have two modes: 1, −2 and −1, 2

Challenges in Bayesian Inference

To fit only one posterior 𝒑(𝒙|𝒐𝟏:𝒎) is already not easy.

[Results reported by Dai et al. (2016)]

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−3

−2

−1

0

1

2

3

1

2

3

4

5

6

7

8

9

10x 10

−3

(a) True posterior (b) Stochastic VariationalInference

(c) Stochastic GradientLangevin Dynamics

(d) Gibbs Sampling (e) One-pass SMC

Fundamental Principle for Machine Learning

Lots of applications in machine learning

• Hidden Markov model

• Topic modeling

• Uncertainty quantification

• 𝑜𝑡+1 = 𝑃𝑜𝑡−𝜏 exp −𝑦𝑡−𝜏

𝑦0𝑒𝑡 + 𝑜𝑡 exp −𝛿𝜖𝑡 ,

• 𝑒𝑡 ∼ Γ 𝜎𝑝−2, 𝜎𝑝

2 , 𝜖𝑡 ∼ Γ 𝜎𝑑−2, 𝜎𝑑

2

• parameters 𝑥 = 𝑃, 𝑦0, 𝜎𝑝2, 𝜎𝑑

2, 𝜏, 𝛿

𝑜𝑧𝜃𝛼𝑀

𝑁

𝑥

𝑥𝑚

𝑜1 𝑜2 𝑜𝑚

……𝑥2𝑥1

sensormeasure

true location

topic

word

Sequential Bayesian Inference

𝑥


𝑝 𝑥 𝑜1 𝑝 𝑥 𝑜1:2 𝑝 𝑥 𝑜1:𝑚…

…… ……

prior 𝜋(𝑥)


𝑝 𝑥 𝑜1:𝑚 ∝ 𝑝 𝑥 𝑜1:𝑚−1 𝑝 𝑜𝑚 𝑥

updated posterior current posterior likelihood

Online Bayesian Inference

• Observations 𝑜1, 𝑜2, … , 𝑜𝑚 arrive sequentially

An ideal algorithm should:

• Efficiently update 𝑝 𝑥 𝑜1:𝑚 to 𝑝 𝑥 𝑜1:𝑚+1 when 𝑜𝑚+1 is observed

• Without storing all historical observations 𝑜1, 𝑜2, … , 𝑜𝑚

Related Work

• MCMC

– requires a complete scan of the data

• Variational Inference (VI)

– requires re-optimization for every new observation

• Stochastic approximate inference

– are prescribed algorithms to optimize the final posterior 𝑝 𝑥 𝑜1:𝑀

– can not exploit the structure of the sequential inference problem

𝑥



…… ……

prior 𝜋(𝑥)


Related Work

Sequential monte Carlo (Doucet et al., 2001; Balakrishnan&Madigan, 2006)

– the state of art for online Bayesian Inference

– but suffers from path degeneracy problem in high dimensions

– rejuvenation steps can help but will violate online constraints (Canini et al., 2009)

𝑥



…… ……

prior 𝜋(𝑥)


Can we learn to perform efficient and effective sequential Bayesian update?

Operator View

Kernel Bayes’ Rule (Fukumizu et al., 2012)

– the posterior is represented as an embedding 𝜇𝑚 = 𝔼𝑝(𝑥|𝑜1:𝑚)[𝜙 𝑥 ]

– 𝜇𝑚+1 = 𝒦( 𝜇𝑚 , 𝑜𝑚+1)

– views the Bayes update as an operator in reproducing kernel Hilbert space (RKHS)

– conceptually nice but is limited in practice

𝑥



…… ……

prior 𝜋(𝑥)


updated embedding current embedding

Our Approach: Bayesian Inference as Particle Flow

𝒳0 = {𝑥01, … , 𝑥0

𝑁}𝒳0 ∼ 𝜋(𝑥)

𝑥1𝑛 = 𝑥0

𝑛 + 0

𝑇

𝑓 𝒳0, 𝑜1, 𝑥(𝑡) 𝑑𝑡

𝒳1 = {𝑥11, … , 𝑥1

𝑁}𝒳1 ∼ 𝑝(𝑥|𝑜1)

Particle Flow

• Start with 𝑵 particles

𝒳0 = {𝑥01, … , 𝑥0

𝑁}, sampled i.i.d. from prior 𝜋(𝑥)

• Transport particles to next posterior via solution of an initial value problem (IVP)

⟹ solution 𝑥1𝑛 = 𝑥(𝑇)

𝑑𝑥

𝑑𝑡= 𝑓 𝒳0, 𝑜1, 𝑥(𝑡) , ∀𝑡 ∈ [0, 𝑇] and 𝑥 0 = 𝑥0

𝑛

Flow Property

• Continuity Equation expresses the law of local conservation of mass:

– Mass can neither be created nor destroyed

– nor can it ‘teleport’ from one place to another

𝜕𝑞 𝑥, 𝑡

𝜕𝑡= −𝛻𝑥 ⋅ (𝑞𝑓)

• Theorem. If 𝑑𝑥

𝑑𝑡= 𝑓, then the change in log-density follows the differential equation

𝑑 log 𝑞 𝑥, 𝑡

𝑑𝑡= −𝛻𝑥 ⋅ 𝑓

• Notation

–𝑑𝑞

𝑑𝑡is material derivative that defines the rate of change of 𝑞 in a given particle as

it moves along its trajectory 𝑥 = 𝑥(𝑡)

–𝜕𝑞

𝜕𝑡is partial derivative that defines the rate of change of 𝑞 at a particular point 𝑥

Particle Flow for Sequential Bayesian Inference

Particle Flow for Sequential Bayesian Inference

𝑥𝑚+1𝑛 = 𝑥𝑚

𝑛 + 0

𝑇

𝑓 𝒳𝑚, 𝑜𝑚+1, 𝑥(𝑡) 𝑑𝑡

−log 𝑝𝑚+1𝑛 = −log𝑝𝑚

𝑛 + 0

𝑇

𝛻𝑥 ⋅ 𝑓 𝒳𝑚, 𝑜𝑚+1, 𝑥(𝑡) 𝑑𝑡

𝒳0 = {𝑥01, … , 𝑥0

𝑁} 𝒳1 = {𝑥11, … , 𝑥1

𝑁} 𝒳2 = {𝑥21, … , 𝑥2

𝑁}

……

0

𝑇

𝑓 𝒳0, 𝑜1, 𝑥(𝑡) 𝑑𝑡 0

𝑇

𝑓 𝒳1, 𝑜2, 𝑥(𝑡) 𝑑𝑡 0

𝑇

𝑓 𝒳2, 𝑜3, 𝑥(𝑡) 𝑑𝑡

𝑥



…… ……

prior 𝜋(𝑥)


• Other ODE approaches (eg. Neural ODE of Chen et al 18), are not for sequential case.

Shared Flow Velocity 𝒇 Exists?

A simple Gaussian Example

• Prior 𝜋 𝑥 = 𝒩(0, 𝜎0), likelihood 𝑝 𝑜 𝑥 = 𝒩 𝑥, σ , observation 𝑜 = 0

• ⟹ posterior 𝑝 𝑥 𝑜 = 0 = 𝒩(0,𝜎⋅𝜎0

𝜎+𝜎0)

• Whether a shared 𝑓 exists for priors with different 𝜎0? What is the form for it?

– E.g. 𝑓 in the form of 𝑓(𝑜, 𝑥(𝑡)) won’t be able to handle different 𝜎0.

𝑥 𝑇 = 𝑥 0 + 0

𝑇

𝑓(𝑖𝑛𝑝𝑢𝑡𝑠)𝑑𝑡

𝜋(𝑥)𝑥 0 ∼ 𝑝(𝑥|𝑜1)𝑥 𝑡 ∼

Does a shared flow velocity 𝑓 exist for different Bayesian inference tasks involving different priors and different observations?

Existence: Connection to Stochastic Flow

• Langevin dynamics is a stochastic process

𝑑𝑥 𝑡 = 𝛻𝑥 log 𝜋 𝑥 𝑝(𝑜|𝑥) 𝑑𝑡 + 2 𝑑𝑤 𝑡 ,

where 𝑑𝑤(𝑡) is a standard Brownian motion.

• Property. If the potential function Ψ 𝑥 ≔ −log 𝜋 𝑥 𝑝(𝑜|𝑥) is smooth and

𝑒−Ψ ∈ 𝐿1(ℝ𝑑) , the Fokker-Planck equation has a unique stationary solution

in the form of Gibbs distribution,

𝑞 𝑥,∞ =𝑒−Ψ

𝑍=𝜋 𝑥 𝑝 𝑜 𝑥

𝑍= 𝑝(𝑥|𝑜)

Existence: Connection to Stochastic Flow

• The probability density 𝑞(𝑥, 𝑡) of 𝑥 𝑡 follows a deterministic evolution according to

the Fokker-Planck equation

𝜕𝑞

𝜕𝑡= −𝛻𝑥 ⋅ 𝑞𝛻𝑥 log 𝜋 𝑥 𝑝 𝑜 𝑥 + Δ𝑥𝑞 𝑥, 𝑡

which is in the form of Continuity Equation.

• Theorem. When the deterministic transformation of random variable 𝑥 𝑡 follows

𝑑𝑥

𝑑𝑡= 𝛻𝑥 log 𝜋 𝑥 𝑝(𝑜|𝑥) − 𝛻𝑥 log 𝑞 𝑥, 𝑡 ,

its probability density 𝑝(𝑥, 𝑡) converges to the posterior 𝑝(𝑥|𝑜) as 𝑡 → ∞.

= −𝛻𝑥 ⋅ (𝑞(𝛻𝑥 log 𝜋 𝑥 𝑝(𝑜|𝑥) − 𝛻𝑥 log 𝑞(𝑥, 𝑡))),

𝑓

Existence: Close-Loop to Open-Loop Conversion

Close loop to Open loop

• Fokker-Planck equation leads to close loop flow, depend not just on 𝜋(𝑥) and 𝑝 𝑜 𝑥 , but also on flow state 𝑞 𝑥, 𝑡 .

• Is there an equivalent form independent of 𝑞 𝑥, 𝑡 which can achieve the same flow? Optimization problem

min𝑤

𝑑 𝑞 𝑥,∞ , 𝑝(𝑥|𝑜)

𝑠. 𝑡.𝑑𝑥

𝑑𝑡= 𝛻𝑥 log 𝜋 𝑥 𝑝(𝑜|𝑥) − 𝑤,

• Positive answer: there exists a fixed and deterministic flow velocity 𝑓 of the form

𝑑𝑥

𝑑𝑡= 𝛻𝑥 log 𝜋 𝑥 𝑝(𝑜|𝑥) − 𝑤∗(𝜋 𝑥 , 𝑝 𝑜|𝑥 , 𝑥, 𝑡)

Parameterization

Parameterization

• 𝝅 𝒙 ⟹ 𝓧

– use samples 𝒳 as surrogates, feature space embedding

– Ideally, if 𝜇𝒳 is an injective mapping from the space of probability measures

• 𝒑 𝒐|𝒙 ⟹ (𝒐, 𝒙(𝒕)) for a fixed likelihood function

• With two neural networks 𝜙 and ℎ, the overall parameterization

𝜇𝒳 𝑝 ≔ 𝒳

𝜙 𝑥 𝑝 𝑥 𝑑𝑥 ≈1

𝑁

𝑛=1

𝑁

𝜙 𝑥𝑛 , 𝑥𝑛 ∼ 𝜋

𝑓 𝒳, 𝑜, 𝑥 𝑡 , 𝑡 = ℎ1

𝑁

𝑛=1

𝑁

𝜙 𝑥𝑛 , 𝑜, 𝑥 𝑡 , 𝑡

𝑑𝑥

𝑑𝑡= 𝛻𝑥 log 𝜋 𝑥 𝑝(𝑜|𝑥) − 𝑤∗(𝜋 𝑥 , 𝑝 𝑜 𝑥 , 𝑥, 𝑡)

Multi-task Framework

• Training set 𝒟𝑡𝑟𝑎𝑖𝑛 for meta learning

– containing multiple inference tasks, with diverse prior and observations

• Each task 𝒯 ∈ 𝒟𝑡𝑟𝑎𝑖𝑛,

Loss Function

• Minimize negative evidence lower bound (ELBO) for each task 𝒯

ℒ 𝒯 = 𝑚=1

𝑀

𝐾𝐿(𝑞𝑚(𝑥)||𝑝(𝑥|𝑜1:𝑚)) = 𝑚=1

𝑀

𝑛=1

𝑁

log 𝑞𝑚𝑛 − log 𝑝(𝑥𝑚

𝑛 , 𝑜1:𝑚) + 𝑐𝑜𝑛𝑠𝑡.

Meta Learning

prior likelihood 𝑀 observations

𝒯 ≔ (𝜋 𝑥 , 𝑝 ⋅ 𝑥 , 𝑜1, 𝑜2, … , 𝑜𝑀 )

𝑥𝑚+1𝑛 = 𝑥𝑚

𝑛 + 0

𝑇

𝑓 𝒳𝑚, 𝑜𝑚+1, 𝑥 𝑡 , 𝑡 𝑑𝑡

−log 𝑞𝑚+1𝑛 = −log 𝑞𝑚

𝑛 + 0

𝑇

𝛻𝑥 ⋅ 𝑓 𝒳𝑚, 𝑜𝑚+1, 𝑥 𝑡 , 𝑡 𝑑𝑡

Experiments: Benefit for High Dimension

Multivariate Gaussian Model

• prior 𝑥 ∼ 𝒩(𝜇𝑥, Σ𝑥)• observation conditioned on prior o|𝑥 ∼ 𝒩(𝑥, Σ𝑜)

Experiment Setting

• Training set only contains sequences of 10 observations, but a diverse set of prior distributions.

• Testing set contains 25 different sequences of 100 observations.Result• As the dimension of the model increases, our MPF operator has more advantages.


• prior 𝑥1, 𝑥2 ∼ 𝒩(0,1)

• observations o|𝑥1, 𝑥2 ∼1

2𝒩 𝑥1, 1 +

1

2𝒩(𝑥1 + 𝑥2, 1)


Experiments: Benefit for Multimodal Posterior

To fit only one posterior 𝒑(𝒙|𝒐𝟏:𝒎) is already not easy.

[Results reported by Dai et al. (2016)]

−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−3

−2

−1

0

1

2

3

1

2

3

4

5

6

7

8

9

10x 10

−3

(a) True posterior (b) Stochastic VariationalInference

(c) Stochastic GradientLangevin Dynamics

(d) Gibbs Sampling (e) One-pass SMC


• prior 𝑥1, 𝑥2 ∼ 𝒩(0,1)

• observations o|𝑥1, 𝑥2 ∼1

2𝒩 𝑥1, 1 +

1

2𝒩(𝑥1 + 𝑥2, 1)


Experiments: Benefit for Multimodal Posterior

Our more challenging experimental setting:

• The learned MPF operator will be tested on sequences that is not observed in the training set.

• It needs to fit all intermediate posteriors 𝑝 𝑥 𝑜1 , 𝑝 𝑥 𝑜1:2 , … , 𝑝 𝑥 𝑜1:𝑚 .

Visualization of the evolution of posterior density from left to right.

Hidden Markov Model – Linear Dynamical System

Experiments: Hidden Markov Model

𝑥𝑚


……𝑥2𝑥1• 𝑥𝑚 = 𝐴 𝑥𝑚−1 + 𝜖𝑚, 𝜖𝑚 ∼ 𝒩(0, Σ1)

• 𝑜𝑚 = 𝐵 𝑥𝑚 + 𝛿𝑚, 𝛿𝑚 ∼ 𝒩(0, Σ2)

Transition sampling + MPF operator:

1. 𝑥𝑚𝑛 = 𝐴 𝑥𝑚−1

𝑛 + 𝜖𝑚

2. 𝑥𝑚𝑛 = ℱ( 𝒳𝑚, 𝑥𝑚

𝑛 , 𝑜𝑚+1) A Bayesian update from 𝑝(𝑥𝑚|𝑜1:𝑚−1) to 𝑝(𝑥𝑚|𝑜1:𝑚)

𝑝(𝑥1|𝑜1) 𝑝(𝑥2|𝑜1:2) 𝑝(𝑥𝑚|𝑜1:𝑚)……Marginal posteriors update:

Experiments: Generalization across Multitasks

Bayesian Logistic Regression on MNIST dataset 8 vs 6

• About 1.6M training images and 1932 testing images

• Each data point 𝑜𝑚: = (𝜙𝑚, 𝑐𝑚)

• Logistic Regression 𝑦 = 𝜎(𝜃⊤𝜙𝑚)

• Likelihood function 𝑝(𝑜𝑚 𝜃 = 𝑦𝑐𝑚 1 − 𝑦 1−𝑐𝑚

feature label

Multi-task Environment

• We reduce the dimension of the images to 50 by PCA

• We rotate the first two components by an angle 𝜓 ∼ −15°, 15°

– The first two components account for more variability in the data

• Different rotation angle 𝜓 ⟹ different decision boundary ⟹ different tasks

Multi-task Training

• MPF operator will be learned from multiple training tasks with different 𝜓

• Use the learned MPF as an online-learning algorithm during testing

Experiments: Generalization across Multitasks

Testing as Online learning:

（1）All algorithms start with a set of particles sampled from the prior;

（2）Each algorithm makes predictions to the encountered batch of 32 images;

（3）All algorithms observe true labels of the encountered batch;

（4）Each algorithm updates its particles and then make predictions to next batch;

（predict → observe true labels → update particles → predict → observe true labels → update ……）

accuracy 𝑟1 accuracy 𝑟2

Conclusion and Future Work

• An ODE-based Bayesian operator for sequential Bayesian inference• Existence• Parametrization• Meta-learning framework

• Future work• Architecture• Stable flow• Improve training

𝒳0 = {𝑥01, … , 𝑥0

𝑁}𝒳0 ∼ 𝜋(𝑥)

𝑥1𝑛 = 𝑥0

𝑛 + 0

𝑇

𝑓 𝒳0, 𝑜1, 𝑥 𝑡 , 𝑡 𝑑𝑡

𝒳1 = {𝑥11, … , 𝑥1

𝑁}𝒳1 ∼ 𝑝(𝑥|𝑜1)

meta particle flow for sequential bayesian inference › materials › slides › ht19-1...–the...

Documents