optimization perspective in approximate posterior...

Optimization perspective on approximate Bayesian inference

Juho Kim

December 6, 2016

Project goals

• Solve an approximate Bayesian inference problem in the perspective of optimization.

• Consider variational Bayesian inference based on various divergence measures.

• Analyze convergence of each optimization empirically and theoretically (if possible).

Inference problem

Given a dataset y = {𝑦1, … , 𝑦𝑛}:

Bayes rule:

Computing posterior distribution is known as the inference problem.

This integral can be very high-dimensional and difficult to compute.

𝑝 𝑦 = න𝑝 𝑦, 𝜃 𝑑𝜃

Approximate Bayesian inference

There are two approaches to approximate inference. They have complementary strengths and weaknesses.

Variational Bayesian inference

In variational Bayesian inference,

• Find an approximate density that is maximally similar to the true posterior distribution.

• Formulate a density estimation problem as an optimization problem.

Variational Bayesian inference

In variational Bayesian inference,

• Find an approximate and tractable density that is maximally similar to the true posterior distribution.

• Formulate a density estimation problem as an optimization problem.

We can use the Kullback-Leibler (KL) divergence as the measure:

Then we minimize KL-divergence.

But we still cannot compute .

Variational lower-bound

We can solve the equivalent optimization problem:

We now remove the intractable terms:

Variational lower-bound / evidence lower bound (ELBO)

Stochastic variational inference

Suppose the joint distribution is represented as the product of each data point.

We can run stochastic (natural) gradient descent on this optimization problem. (i.e. stochastic variational inference)

𝑝 𝜃, 𝐷 = 𝑝0(𝜃)ෑ

𝑝(𝑦𝑛|𝜃)

When KL divergence does not work well

• Variational inference does not work for non-smooth potentials well.

• KL divergence tends to underestimate the support due to its zero-forcing behavior.

→ The optimal variational distribution q is defined as zero when 𝑝 𝜃 𝑦 = 0

to avoid that it has an infinite value when 𝑝 𝜃 𝑦 = 0 and 𝑞 ∙ > 0.

• Variational inference does not work for non-smooth potentials well.

• KL divergence tends to underestimate the support due to its zero-forcing behavior.

→ The optimal variational distribution q is defined as zero when 𝑝 𝜃 𝑦 = 0

to avoid that it has an infinite value when 𝑝 𝜃 𝑦 = 0 and 𝑞 ∙ > 0.

• In this example, the result of variational inference will fit a delta function.

One possible solution of this issue.

→ Use a different optimization formulation based on another divergence measure: Expectation Propagation (EP).

Minimizes KL(𝑝| 𝑞 instead of KL(𝑞| 𝑝

EP also has issues

EP tends to overestimate the support of the original distribution.

→ Try to use other divergence measure such as Renyi’s alpha divergence,

f-divergence, other operator based divergence, etc.

Alternative 1 – alpha divergence

• The two forms of KL divergence are members of the alpha divergence:

Inference based on alpha divergence

We can solve the equivalent optimization problem following the idea of variational inference:

We can derive the lower bound like ELBO for 𝛼 ≠ 1:

Inference based on alpha divergence

• Unfortunately, the lower bound is less tractable than ELBO.

• Apply Monte Carlo methods to estimate the lower bound:

Draw 𝜃𝑘~𝑞(𝜃), 𝑘 = 1,… , 𝐾:

• Future work: Find a stable gradient-based optimization method.

Simple experiment

1. Estimate a polynomial function below.

2. Estimate a 2D Gaussian distribution.

Simple experiment

Alpha = -1

Simple experiment

Alpha = -0.5

Simple experiment

Alpha = 0 (the same as KL divergence minimization)

Simple experiment

Alpha = 0.5

Simple experiment

Alpha = 1 (the same as expectation propagation)

Alternative 2 – chi-square divergence

Minimizing the chi-square divergence is equivalent to minimizing

This quantity is an upper bound to the model evidence:

By maximizing ELBO and minimizing chi-square bound together, we might estimate the distribution more accurately.

Alternative 3 – f-divergence

where 𝑓:ℝ+ → ℝ is a convex, lower-semicontinuous function with 𝑓 1 = 0.

Conclusion

• Consider optimization-based variational Bayesian inference methods based on statistical divergences different from KL divergence.

• Observe the behavior of inference methods based on alpha divergence and chi-square divergence.

Future work

• Suggest more stable gradient-based optimization methods by reducing the variance of gradients.

• Consider more general form of divergences.

• Analyze convergence of each optimization theoretically (if possible).

Questions?

optimization perspective in approximate posterior...

Documents

bayes and naïve bayes classifiers

human-oriented robotics probability...

on evaluating the renaissance benchmarking suite: variety...

ca266 7 posterior probability and bayes

lecture 1: overview of bayesian...

introduction to probabilistic image processing and ... ›...

updating variational bayes: fast sequential posterior...

feature set subspacing...• python 2.7 • numpy • scikit...

prior-posterior analysis and...

bayesian estimation & information...

robust and scalable bayes via a median of subset posterior

bayesian inference · bayes theorem allows one to formally...

last lecture summary naïve bayes classifier. bayes rule...

robust and scalable bayes via a median of subset posterior...

variational bayes on monte carlo...

bayesian inference in survey research: applications to...

probabilistic graphical...

probabilistic graphical...

cs340: machine learning naive bayes classifiers kevin...

bayes, oracle bayes, and empirical bayes by bradley...