meta particle flow for sequential bayesian inference › materials › slides › ht19-1...–the...
TRANSCRIPT
Meta Particle Flowfor Sequential Bayesian Inference
Le Song
Associate Professor, CSE
Associate Director, Machine Learning Center
Georgia Institute of Technology
Joint work with Xinshi Chen and Hanjun Dai
Bayesian Inference
Infer the posterior distribution of unknown parameter 𝒙 given
• Prior distribution 𝜋(𝑥)
• Likelihood function 𝑝(𝑜|𝑥)
• Observations 𝑜1, 𝑜2, … , 𝑜𝑚
𝑝 𝑥 𝑜1:𝑚 =1
𝑧𝜋 𝑥
𝑖=1
𝑚
𝑝(𝑜𝑖|𝑥)
𝑧 = 𝜋 𝑥
𝑖=1
𝑚
𝑝(𝑜𝑖|𝑥) 𝑑𝑥
Challenging computational problem for high dimensional 𝑥
𝑥
𝑜1 𝑜2 𝑜𝑚…… ……
Gaussian Mixture Model
• prior 𝑥1, 𝑥2 ∼ 𝜋 𝑥 = 𝒩(0, 𝐼)
• observations o|𝑥1, 𝑥2 ∼ 𝑝 𝑜 𝑥1, 𝑥2 =1
2𝒩 𝑥1, 1 +
1
2𝒩(𝑥1 + 𝑥2, 1)
• With 𝑥1, 𝑥2 = (1,−2), the resulting posterior will have two modes: 1, −2 and −1, 2
Challenges in Bayesian Inference
To fit only one posterior 𝒑(𝒙|𝒐𝟏:𝒎) is already not easy.
[Results reported by Dai et al. (2016)]
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−3
−2
−1
0
1
2
3
1
2
3
4
5
6
7
8
9
10x 10
−3
(a) True posterior (b) Stochastic VariationalInference
(c) Stochastic GradientLangevin Dynamics
(d) Gibbs Sampling (e) One-pass SMC
Fundamental Principle for Machine Learning
Lots of applications in machine learning
• Hidden Markov model
• Topic modeling
• Uncertainty quantification
• 𝑜𝑡+1 = 𝑃𝑜𝑡−𝜏 exp −𝑦𝑡−𝜏
𝑦0𝑒𝑡 + 𝑜𝑡 exp −𝛿𝜖𝑡 ,
• 𝑒𝑡 ∼ Γ 𝜎𝑝−2, 𝜎𝑝
2 , 𝜖𝑡 ∼ Γ 𝜎𝑑−2, 𝜎𝑑
2
• parameters 𝑥 = 𝑃, 𝑦0, 𝜎𝑝2, 𝜎𝑑
2, 𝜏, 𝛿
𝑜𝑧𝜃𝛼𝑀
𝑁
𝑥
𝑥𝑚
𝑜1 𝑜2 𝑜𝑚
……𝑥2𝑥1
sensormeasure
true location
topic
word
Sequential Bayesian Inference
𝑥
𝑜1 𝑜2 𝑜𝑚
𝑝 𝑥 𝑜1 𝑝 𝑥 𝑜1:2 𝑝 𝑥 𝑜1:𝑚…
…… ……
prior 𝜋(𝑥)
𝑜1 𝑜2 𝑜𝑚
𝑝 𝑥 𝑜1:𝑚 ∝ 𝑝 𝑥 𝑜1:𝑚−1 𝑝 𝑜𝑚 𝑥
updated posterior current posterior likelihood
Online Bayesian Inference
• Observations 𝑜1, 𝑜2, … , 𝑜𝑚 arrive sequentially
An ideal algorithm should:
• Efficiently update 𝑝 𝑥 𝑜1:𝑚 to 𝑝 𝑥 𝑜1:𝑚+1 when 𝑜𝑚+1 is observed
• Without storing all historical observations 𝑜1, 𝑜2, … , 𝑜𝑚
Related Work
• MCMC
– requires a complete scan of the data
• Variational Inference (VI)
– requires re-optimization for every new observation
• Stochastic approximate inference
– are prescribed algorithms to optimize the final posterior 𝑝 𝑥 𝑜1:𝑀
– can not exploit the structure of the sequential inference problem
𝑥
𝑜1 𝑜2 𝑜𝑚
𝑝 𝑥 𝑜1 𝑝 𝑥 𝑜1:2 𝑝 𝑥 𝑜1:𝑚…
…… ……
prior 𝜋(𝑥)
𝑜1 𝑜2 𝑜𝑚
Related Work
Sequential monte Carlo (Doucet et al., 2001; Balakrishnan&Madigan, 2006)
– the state of art for online Bayesian Inference
– but suffers from path degeneracy problem in high dimensions
– rejuvenation steps can help but will violate online constraints (Canini et al., 2009)
𝑥
𝑜1 𝑜2 𝑜𝑚
𝑝 𝑥 𝑜1 𝑝 𝑥 𝑜1:2 𝑝 𝑥 𝑜1:𝑚…
…… ……
prior 𝜋(𝑥)
𝑜1 𝑜2 𝑜𝑚
Can we learn to perform efficient and effective sequential Bayesian update?
Operator View
Kernel Bayes’ Rule (Fukumizu et al., 2012)
– the posterior is represented as an embedding 𝜇𝑚 = 𝔼𝑝(𝑥|𝑜1:𝑚)[𝜙 𝑥 ]
– 𝜇𝑚+1 = 𝒦( 𝜇𝑚 , 𝑜𝑚+1)
– views the Bayes update as an operator in reproducing kernel Hilbert space (RKHS)
– conceptually nice but is limited in practice
𝑥
𝑜1 𝑜2 𝑜𝑚
𝑝 𝑥 𝑜1 𝑝 𝑥 𝑜1:2 𝑝 𝑥 𝑜1:𝑚…
…… ……
prior 𝜋(𝑥)
𝑜1 𝑜2 𝑜𝑚
updated embedding current embedding
Our Approach: Bayesian Inference as Particle Flow
𝒳0 = {𝑥01, … , 𝑥0
𝑁}𝒳0 ∼ 𝜋(𝑥)
𝑥1𝑛 = 𝑥0
𝑛 + 0
𝑇
𝑓 𝒳0, 𝑜1, 𝑥(𝑡) 𝑑𝑡
𝒳1 = {𝑥11, … , 𝑥1
𝑁}𝒳1 ∼ 𝑝(𝑥|𝑜1)
Particle Flow
• Start with 𝑵 particles
𝒳0 = {𝑥01, … , 𝑥0
𝑁}, sampled i.i.d. from prior 𝜋(𝑥)
• Transport particles to next posterior via solution of an initial value problem (IVP)
⟹ solution 𝑥1𝑛 = 𝑥(𝑇)
𝑑𝑥
𝑑𝑡= 𝑓 𝒳0, 𝑜1, 𝑥(𝑡) , ∀𝑡 ∈ [0, 𝑇] and 𝑥 0 = 𝑥0
𝑛
Flow Property
• Continuity Equation expresses the law of local conservation of mass:
– Mass can neither be created nor destroyed
– nor can it ‘teleport’ from one place to another
𝜕𝑞 𝑥, 𝑡
𝜕𝑡= −𝛻𝑥 ⋅ (𝑞𝑓)
• Theorem. If 𝑑𝑥
𝑑𝑡= 𝑓, then the change in log-density follows the differential equation
𝑑 log 𝑞 𝑥, 𝑡
𝑑𝑡= −𝛻𝑥 ⋅ 𝑓
• Notation
–𝑑𝑞
𝑑𝑡is material derivative that defines the rate of change of 𝑞 in a given particle as
it moves along its trajectory 𝑥 = 𝑥(𝑡)
–𝜕𝑞
𝜕𝑡is partial derivative that defines the rate of change of 𝑞 at a particular point 𝑥
Particle Flow for Sequential Bayesian Inference
Particle Flow for Sequential Bayesian Inference
𝑥𝑚+1𝑛 = 𝑥𝑚
𝑛 + 0
𝑇
𝑓 𝒳𝑚, 𝑜𝑚+1, 𝑥(𝑡) 𝑑𝑡
−log 𝑝𝑚+1𝑛 = −log𝑝𝑚
𝑛 + 0
𝑇
𝛻𝑥 ⋅ 𝑓 𝒳𝑚, 𝑜𝑚+1, 𝑥(𝑡) 𝑑𝑡
𝒳0 = {𝑥01, … , 𝑥0
𝑁} 𝒳1 = {𝑥11, … , 𝑥1
𝑁} 𝒳2 = {𝑥21, … , 𝑥2
𝑁}
……
0
𝑇
𝑓 𝒳0, 𝑜1, 𝑥(𝑡) 𝑑𝑡 0
𝑇
𝑓 𝒳1, 𝑜2, 𝑥(𝑡) 𝑑𝑡 0
𝑇
𝑓 𝒳2, 𝑜3, 𝑥(𝑡) 𝑑𝑡
𝑥
𝑜1 𝑜2 𝑜𝑚
𝑝 𝑥 𝑜1 𝑝 𝑥 𝑜1:2 𝑝 𝑥 𝑜1:𝑚…
…… ……
prior 𝜋(𝑥)
𝑜1 𝑜2 𝑜𝑚
• Other ODE approaches (eg. Neural ODE of Chen et al 18), are not for sequential case.
Shared Flow Velocity 𝒇 Exists?
A simple Gaussian Example
• Prior 𝜋 𝑥 = 𝒩(0, 𝜎0), likelihood 𝑝 𝑜 𝑥 = 𝒩 𝑥, σ , observation 𝑜 = 0
• ⟹ posterior 𝑝 𝑥 𝑜 = 0 = 𝒩(0,𝜎⋅𝜎0
𝜎+𝜎0)
• Whether a shared 𝑓 exists for priors with different 𝜎0? What is the form for it?
– E.g. 𝑓 in the form of 𝑓(𝑜, 𝑥(𝑡)) won’t be able to handle different 𝜎0.
𝑥 𝑇 = 𝑥 0 + 0
𝑇
𝑓(𝑖𝑛𝑝𝑢𝑡𝑠)𝑑𝑡
𝜋(𝑥)𝑥 0 ∼ 𝑝(𝑥|𝑜1)𝑥 𝑡 ∼
Does a shared flow velocity 𝑓 exist for different Bayesian inference tasks involving different priors and different observations?
Existence: Connection to Stochastic Flow
• Langevin dynamics is a stochastic process
𝑑𝑥 𝑡 = 𝛻𝑥 log 𝜋 𝑥 𝑝(𝑜|𝑥) 𝑑𝑡 + 2 𝑑𝑤 𝑡 ,
where 𝑑𝑤(𝑡) is a standard Brownian motion.
• Property. If the potential function Ψ 𝑥 ≔ −log 𝜋 𝑥 𝑝(𝑜|𝑥) is smooth and
𝑒−Ψ ∈ 𝐿1(ℝ𝑑) , the Fokker-Planck equation has a unique stationary solution
in the form of Gibbs distribution,
𝑞 𝑥,∞ =𝑒−Ψ
𝑍=𝜋 𝑥 𝑝 𝑜 𝑥
𝑍= 𝑝(𝑥|𝑜)
Existence: Connection to Stochastic Flow
• The probability density 𝑞(𝑥, 𝑡) of 𝑥 𝑡 follows a deterministic evolution according to
the Fokker-Planck equation
𝜕𝑞
𝜕𝑡= −𝛻𝑥 ⋅ 𝑞𝛻𝑥 log 𝜋 𝑥 𝑝 𝑜 𝑥 + Δ𝑥𝑞 𝑥, 𝑡
which is in the form of Continuity Equation.
• Theorem. When the deterministic transformation of random variable 𝑥 𝑡 follows
𝑑𝑥
𝑑𝑡= 𝛻𝑥 log 𝜋 𝑥 𝑝(𝑜|𝑥) − 𝛻𝑥 log 𝑞 𝑥, 𝑡 ,
its probability density 𝑝(𝑥, 𝑡) converges to the posterior 𝑝(𝑥|𝑜) as 𝑡 → ∞.
= −𝛻𝑥 ⋅ (𝑞(𝛻𝑥 log 𝜋 𝑥 𝑝(𝑜|𝑥) − 𝛻𝑥 log 𝑞(𝑥, 𝑡))),
𝑓
Existence: Close-Loop to Open-Loop Conversion
Close loop to Open loop
• Fokker-Planck equation leads to close loop flow, depend not just on 𝜋(𝑥) and 𝑝 𝑜 𝑥 , but also on flow state 𝑞 𝑥, 𝑡 .
• Is there an equivalent form independent of 𝑞 𝑥, 𝑡 which can achieve the same flow? Optimization problem
min𝑤
𝑑 𝑞 𝑥,∞ , 𝑝(𝑥|𝑜)
𝑠. 𝑡.𝑑𝑥
𝑑𝑡= 𝛻𝑥 log 𝜋 𝑥 𝑝(𝑜|𝑥) − 𝑤,
• Positive answer: there exists a fixed and deterministic flow velocity 𝑓 of the form
𝑑𝑥
𝑑𝑡= 𝛻𝑥 log 𝜋 𝑥 𝑝(𝑜|𝑥) − 𝑤∗(𝜋 𝑥 , 𝑝 𝑜|𝑥 , 𝑥, 𝑡)
Parameterization
Parameterization
• 𝝅 𝒙 ⟹ 𝓧
– use samples 𝒳 as surrogates, feature space embedding
– Ideally, if 𝜇𝒳 is an injective mapping from the space of probability measures
• 𝒑 𝒐|𝒙 ⟹ (𝒐, 𝒙(𝒕)) for a fixed likelihood function
• With two neural networks 𝜙 and ℎ, the overall parameterization
𝜇𝒳 𝑝 ≔ 𝒳
𝜙 𝑥 𝑝 𝑥 𝑑𝑥 ≈1
𝑁
𝑛=1
𝑁
𝜙 𝑥𝑛 , 𝑥𝑛 ∼ 𝜋
𝑓 𝒳, 𝑜, 𝑥 𝑡 , 𝑡 = ℎ1
𝑁
𝑛=1
𝑁
𝜙 𝑥𝑛 , 𝑜, 𝑥 𝑡 , 𝑡
𝑑𝑥
𝑑𝑡= 𝛻𝑥 log 𝜋 𝑥 𝑝(𝑜|𝑥) − 𝑤∗(𝜋 𝑥 , 𝑝 𝑜 𝑥 , 𝑥, 𝑡)
Multi-task Framework
• Training set 𝒟𝑡𝑟𝑎𝑖𝑛 for meta learning
– containing multiple inference tasks, with diverse prior and observations
• Each task 𝒯 ∈ 𝒟𝑡𝑟𝑎𝑖𝑛,
Loss Function
• Minimize negative evidence lower bound (ELBO) for each task 𝒯
ℒ 𝒯 = 𝑚=1
𝑀
𝐾𝐿(𝑞𝑚(𝑥)||𝑝(𝑥|𝑜1:𝑚)) = 𝑚=1
𝑀
𝑛=1
𝑁
log 𝑞𝑚𝑛 − log 𝑝(𝑥𝑚
𝑛 , 𝑜1:𝑚) + 𝑐𝑜𝑛𝑠𝑡.
Meta Learning
prior likelihood 𝑀 observations
𝒯 ≔ (𝜋 𝑥 , 𝑝 ⋅ 𝑥 , 𝑜1, 𝑜2, … , 𝑜𝑀 )
𝑥𝑚+1𝑛 = 𝑥𝑚
𝑛 + 0
𝑇
𝑓 𝒳𝑚, 𝑜𝑚+1, 𝑥 𝑡 , 𝑡 𝑑𝑡
−log 𝑞𝑚+1𝑛 = −log 𝑞𝑚
𝑛 + 0
𝑇
𝛻𝑥 ⋅ 𝑓 𝒳𝑚, 𝑜𝑚+1, 𝑥 𝑡 , 𝑡 𝑑𝑡
Experiments: Benefit for High Dimension
Multivariate Gaussian Model
• prior 𝑥 ∼ 𝒩(𝜇𝑥, Σ𝑥)• observation conditioned on prior o|𝑥 ∼ 𝒩(𝑥, Σ𝑜)
Experiment Setting
• Training set only contains sequences of 10 observations, but a diverse set of prior distributions.
• Testing set contains 25 different sequences of 100 observations.Result• As the dimension of the model increases, our MPF operator has more advantages.
Gaussian Mixture Model
• prior 𝑥1, 𝑥2 ∼ 𝒩(0,1)
• observations o|𝑥1, 𝑥2 ∼1
2𝒩 𝑥1, 1 +
1
2𝒩(𝑥1 + 𝑥2, 1)
• With 𝑥1, 𝑥2 = (1,−2), the resulting posterior will have two modes: 1, −2 and −1, 2
Experiments: Benefit for Multimodal Posterior
To fit only one posterior 𝒑(𝒙|𝒐𝟏:𝒎) is already not easy.
[Results reported by Dai et al. (2016)]
−2 −1.5 −1 −0.5 0 0.5 1 1.5 2−3
−2
−1
0
1
2
3
1
2
3
4
5
6
7
8
9
10x 10
−3
(a) True posterior (b) Stochastic VariationalInference
(c) Stochastic GradientLangevin Dynamics
(d) Gibbs Sampling (e) One-pass SMC
Gaussian Mixture Model
• prior 𝑥1, 𝑥2 ∼ 𝒩(0,1)
• observations o|𝑥1, 𝑥2 ∼1
2𝒩 𝑥1, 1 +
1
2𝒩(𝑥1 + 𝑥2, 1)
• With 𝑥1, 𝑥2 = (1,−2), the resulting posterior will have two modes: 1, −2 and −1, 2
Experiments: Benefit for Multimodal Posterior
Our more challenging experimental setting:
• The learned MPF operator will be tested on sequences that is not observed in the training set.
• It needs to fit all intermediate posteriors 𝑝 𝑥 𝑜1 , 𝑝 𝑥 𝑜1:2 , … , 𝑝 𝑥 𝑜1:𝑚 .
Visualization of the evolution of posterior density from left to right.
Hidden Markov Model – Linear Dynamical System
Experiments: Hidden Markov Model
𝑥𝑚
𝑜1 𝑜2 𝑜𝑚
……𝑥2𝑥1• 𝑥𝑚 = 𝐴 𝑥𝑚−1 + 𝜖𝑚, 𝜖𝑚 ∼ 𝒩(0, Σ1)
• 𝑜𝑚 = 𝐵 𝑥𝑚 + 𝛿𝑚, 𝛿𝑚 ∼ 𝒩(0, Σ2)
Transition sampling + MPF operator:
1. 𝑥𝑚𝑛 = 𝐴 𝑥𝑚−1
𝑛 + 𝜖𝑚
2. 𝑥𝑚𝑛 = ℱ( 𝒳𝑚, 𝑥𝑚
𝑛 , 𝑜𝑚+1) A Bayesian update from 𝑝(𝑥𝑚|𝑜1:𝑚−1) to 𝑝(𝑥𝑚|𝑜1:𝑚)
𝑝(𝑥1|𝑜1) 𝑝(𝑥2|𝑜1:2) 𝑝(𝑥𝑚|𝑜1:𝑚)……Marginal posteriors update:
Experiments: Generalization across Multitasks
Bayesian Logistic Regression on MNIST dataset 8 vs 6
• About 1.6M training images and 1932 testing images
• Each data point 𝑜𝑚: = (𝜙𝑚, 𝑐𝑚)
• Logistic Regression 𝑦 = 𝜎(𝜃⊤𝜙𝑚)
• Likelihood function 𝑝(𝑜𝑚 𝜃 = 𝑦𝑐𝑚 1 − 𝑦 1−𝑐𝑚
feature label
Multi-task Environment
• We reduce the dimension of the images to 50 by PCA
• We rotate the first two components by an angle 𝜓 ∼ −15°, 15°
– The first two components account for more variability in the data
• Different rotation angle 𝜓 ⟹ different decision boundary ⟹ different tasks
Multi-task Training
• MPF operator will be learned from multiple training tasks with different 𝜓
• Use the learned MPF as an online-learning algorithm during testing
Experiments: Generalization across Multitasks
Testing as Online learning:
(1)All algorithms start with a set of particles sampled from the prior;
(2)Each algorithm makes predictions to the encountered batch of 32 images;
(3)All algorithms observe true labels of the encountered batch;
(4)Each algorithm updates its particles and then make predictions to next batch;
(predict → observe true labels → update particles → predict → observe true labels → update ……)
accuracy 𝑟1 accuracy 𝑟2
Conclusion and Future Work
• An ODE-based Bayesian operator for sequential Bayesian inference• Existence• Parametrization• Meta-learning framework
• Future work• Architecture• Stable flow• Improve training
𝒳0 = {𝑥01, … , 𝑥0
𝑁}𝒳0 ∼ 𝜋(𝑥)
𝑥1𝑛 = 𝑥0
𝑛 + 0
𝑇
𝑓 𝒳0, 𝑜1, 𝑥 𝑡 , 𝑡 𝑑𝑡
𝒳1 = {𝑥11, … , 𝑥1
𝑁}𝒳1 ∼ 𝑝(𝑥|𝑜1)