variational inference for dirichlet process mixture daniel klein and soravit beer changpinyo october...
TRANSCRIPT
![Page 1: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/1.jpg)
Variational Inference for Dirichlet Process Mixture
Daniel Klein and Soravit Beer ChangpinyoOctober 11, 2011
Applied Bayesian NonparametricsSpecial Topics in Machine Learning
Brown University CSCI 2950-P, Fall 2011
![Page 2: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/2.jpg)
Motivation• WANTED! A systematic approach to sample
from likelihoods and posterior distributions of the DP mixture models
• Markov Chain Monte Carlo (MCMC)• Problems with MCMC
o Can be slow to convergeo Convergence can be difficult to diagnose
• One alternative: Variational methods
![Page 3: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/3.jpg)
Variational Methods: Big Picture• An adjustable lower bound on the log
likelihood, indexed by“Variational parameters”
• Optimization problem: to get the tightest lower bound
![Page 4: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/4.jpg)
Outline• Brief Review: Dirichlet Process Mixture Models• Variational Inference in Exponential Families• Variational Inference for DP mixtures• Gibbs sampling (MCMC)• Experiments
![Page 5: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/5.jpg)
DP Mixture Models
From E.B. Sudderth’s slides
![Page 6: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/6.jpg)
DP Mixture Models
Stick lengths =weights assigned to mixture components
Atoms representing mixture components (cluster parameters)
![Page 7: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/7.jpg)
DP Mixture Models: NotationsLatent variables Hyperparameters
Observations
![Page 8: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/8.jpg)
DP Mixture Models: NotationsLatent variables Hyperparameters
Observations
W = {V, η*, Z}
θ = {α, λ}
X
![Page 9: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/9.jpg)
Variational Inference
Usually intractable
So, we are going to approximate it by finding a lower bound of P(X|θ)
![Page 10: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/10.jpg)
Variational Inference
Jensen’s inequality
Variational distribution
![Page 11: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/11.jpg)
Variational Inference
Add constraint to q by introducing
:=
“the free variational parameters”
ν
![Page 12: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/12.jpg)
Variational Inference
![Page 13: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/13.jpg)
Variational Inference
How to choose the variational distribution qν(w) such that
the optimization of the bound is computationally tractable?
Typically, we break some dependencies between latent variables
Mean field variational approximations
Assume “fully factorized” variational distributions
q ν (w )=∏𝒎=𝟏
𝑴
𝒒ν𝒎( wm )
= (
![Page 14: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/14.jpg)
Mean Field Variational Inference
Assume fully factorized variational distributions
![Page 15: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/15.jpg)
Mean Field Variational Inference
in Exponential Families
Further assume that p(wi | w-i, x, θ)
is a member in exponential family
Further assume that is a member in exponential family
![Page 16: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/16.jpg)
Mean Field Variational Inference
in Exponential Families
Further assume that p(wi | w-i, x, θ) is a member in exponential family
Further assume that is a member in exponential family
![Page 17: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/17.jpg)
Mean Field Variational Inference
in Exponential Families: Coordinate Ascent
Maximize this with respect to holding other fixed
Leads to an EM-like algorithm:Iteratively update
This algorithm will find a local maximum of the above expression
![Page 18: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/18.jpg)
Recap: Mean Field Variational Inference
in Exponential Families
p(wi | w-i, x, θ) is a member in
exponential family
Some calculus
Fully factorizedvariational distributions
A local maximum of
![Page 19: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/19.jpg)
19
Update Equation and Other Inference Methods
• Like Gibbs sampling: iteratively pick a component to update using the exclude-one conditional distributiono Gibbs walks on state that approaches sample from true posterioro VDP walks on distributions that approach a locally best approximation to the
true posterior
• Like EM: fit a lower bound to the true posterioro EM maximizes, VDP marginalizeso May find local maxima
Figure from Bishop (2006)
![Page 20: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/20.jpg)
20
Aside: Derivation of Update Equation• Nothing deep involved...
o Expansion of variational lower bound using chain rule for expectations
o Set derivative equal to zero and solve
o Take advantage of exponential form of exclude-one conditional distribution
o Everything cancels...except the update equation
![Page 21: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/21.jpg)
Aside: Which Kullback-Leibler Divergence?
KL(q||p) KL(p||q)
To minimize the reverse KL divergence (when q factorizes), just match the marginals.
Minimizing the reverse KL is the approach taken in expectation propagation.
Figures from Bishop (2006)
![Page 22: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/22.jpg)
22
Aside: Which Kullback-Leibler Divergence?
• Minimizing KL divergence is “zero-forcing”• Minimizing reverse KL divergence is “zero-avoiding”
KL(q||p) KL(p||q)Figures from Bishop (2006)
![Page 23: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/23.jpg)
23
Applying Mean-Field Variational Inference to DP Mixtures
• “Mean field variational inference in exponential families”o But we’re in a mixture model, which can’t be an exponential family!
• Enough that the exclude-one conditional distributions are in the exponential family. Examples:o Hidden Markov modelso Mixture modelso State space modelso Hierarchical Bayesian models with (mixture of) conjugate priors
![Page 24: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/24.jpg)
24
Variational Lower Bound for DP Mixtures
• Plug the DP Mixture posterior distributiono Taking log so expectations factor...o Shouldn’t the emission term depend on η*?
• Last term has implications for choice of variational distribution
![Page 25: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/25.jpg)
25
Picking the Variational Distribution
• Obviously, we want to break dependencies
• Must the factors be exponential families?o In some cases, the optimum must be!
• Proof using calculus of variationso Easier to compute integrals for lower boundo Guarantee of optimal parameters
• Mapping between canonical and moment parameters
• Beta, exponential family, and multinomial distributions, respectively
![Page 26: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/26.jpg)
26
Coordinate Ascent• Analogy to EM: we might get stuck in local maxima
![Page 27: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/27.jpg)
27
Coordinate Ascent: Derivation
• Relies on clever use of indicator functions and their properties• All the terms in the truncation have closed-form expressions
![Page 28: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/28.jpg)
28
Predictive Distribution• Under variational approximation, distribution of atoms and
the (truncated) distribution of stick lengths decouple• Weighted sum of predictive distributions• Suggestive of a MC approximation
![Page 29: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/29.jpg)
29
Extensions• Prior as mixture of conjugate distributions• Placing a prior on the scaling parameter α
o Continue complete factorization...o Natural to place Gamma prior on αo Update equation no more difficult than the otherso No modification needed to predictive distribution!
![Page 30: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/30.jpg)
30
Empirical Comparison: The Competition
• Collapsed Gibbs sampler (MacEachern 1994)o “CDP”o Predictive distribution as average of predictive distributions from MC sampleso Best suited for conjugate priors
• Blocked Gibbs sampler (Ishwaran and James 2001)o “TDP”o Recall: posterior distribution gets truncatedo Surface similarities to VDP in updates for Z, V, η*o Predictive distribution integrates out everything but Z
• Surprise:TDP CDP
Autocorrelation on size of largest component
![Page 31: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/31.jpg)
Empirical Comparison
![Page 32: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/32.jpg)
Empirical Comparison
![Page 33: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/33.jpg)
Empirical Comparison
![Page 34: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/34.jpg)
Empirical Comparison: SummaryDeterministic
FastEasy to assess convergence
Sensitive to initializations = Local MaximumApproximate
![Page 35: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/35.jpg)
Image Analysis
![Page 36: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/36.jpg)
MNIST: Hand-written digits
Kurihara, Welling, and Vlassis 2006
![Page 37: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/37.jpg)
MNIST: Hand-written digits
Kurihara, Welling, and Teh 2007“Variational approximations are much more efficient computationally than Gibbs sampling, with almost no loss in accuracy”
![Page 38: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/38.jpg)
Questions?
![Page 39: Variational Inference for Dirichlet Process Mixture Daniel Klein and Soravit Beer Changpinyo October 11, 2011 Applied Bayesian Nonparametrics Special Topics](https://reader035.vdocument.in/reader035/viewer/2022062421/56649d145503460f949e828a/html5/thumbnails/39.jpg)
Acknowledgement• http://www.cs.princeton.edu/courses/archive/fall07/cos597C/scribe/2007
1022a.pdf• http://www.cs.princeton.edu/courses/archive/fall07/cos597C/scribe/2007
1022b.pdf