variational inference and learning - university at buffalocedar.buffalo.edu/~srihari/cse676/19.4...
TRANSCRIPT
Deep Learning Srihari
Topics in Approximate Inference
• Task of Inference• Intractability in Inference1.Inference as Optimization2.Expectation Maximization3.MAP Inference and Sparse Coding4.Variational Inference and Learning5.Learned Approximate Inference
2
Deep Learning Srihari
Setup for Variational Inference
3
log p(v ; 𝞱)
v
Connections
h
DKL(q(h|v)||p(h|v;θ)
• Observed variables v and latent h• Known joint distribution p (v,h; 𝞱) • We would like to compute log p(v ; 𝞱)
so as to choose θ to increase it – It is too costly to marginalize h using
• Instead compute a lower bound 𝓛 (v, 𝞱, q) on p(v ; 𝞱)
• Approx inference is approximate optimization to find q
p(v; θ)=Σh p(v,h; 𝞱)
𝓛(v,𝜃,q)=log p(v;𝜃)-DKL(q(h|v)||p(h|v)
Deep Learning Srihari
Summary of Approximate Inference• ELBO L (v, θ, q) is a lower bound on log p(v ; θ)
– Inference viewed as maximizing L wrt q– Learning viewed as maximizing L wrt θ
1. EM allows large learning steps with a fixed q2. Learning based on MAP inference
– Learn using a point estimate of p(h | v) rather than inferring entire distribution
3. Next: more general approach to variational learning
4
Deep Learning Srihari
Common Variational Approaches
5
Deep Learning Srihari
Variational: Discrete vs Continuous• Specifying how to factorize means:
– Discrete case:• Optimize finite no. of variables describing q distribution
– Continuous case:• Use calculus of variations over a space of functions• Determine which function should be used to represent q
• Although calculus of variations is not used in discrete case, it is still called as variational
• Calculus of variations is powerful– Only need specify factorization, not design of q
Deep Learning Srihari
Maximizing 𝓛 means fitting q to p• Maximizing 𝓛 wrt q is equivalent to minimizing
DKL(q(h|v)||p(h|v)• In this sense, we are fitting q to p• However, we are doing so with the opposite
direction of the KL divergence than we are used to using for fitting an approximation
• When we use maximum likelihood learning to fit a model to data, we minimize DKL(pdata||pmodel)
7
Deep Learning Srihari
Optimization matches low probabilities
• Maximum likelihood encourages model to have high probability wherever data has high probability
• Optimization-based inference encourages q to have low probability wherever true posterior has low probability– Illustrated in figure next
8
Deep Learning Srihari
Asymmetry of KL divergence• We wish to approximate p with q
– We have a choice of using DKL(p||q) or DKL(q||p)– Bimodal Gaussian for p and Gaussian for q
• (a) Effect of minimizing DKL(p||q)– select q that has high probability where p has high probability– q chooses to blur the two modes together
• (b) Effect of minimizing DKL(q||p)– select q that has low probability where p has low probability– Avoids putting probability mass in low probability areas between
modes 9
Deep Learning Srihari
Choice of KL Direction• Directions of KL have different properties• Choice depends on application
– For inference optimization use DKL(q(h|v)||p(h|v) for computational reasons.
• Computing DKL(q(h|v)||p(h|v) involves evaluating expectations wrt q, so by designing q to be simple, we can simplify the required expectations
– Opposite direction of KL requires computing expectations wrt true posterior
• Because the form of the true posterior is determined by the choice of model, we cannot design a reduced-cost approach to computing DKL(p(h|v)||q(h|v) exactly
10
Deep Learning Srihari
Topics in Variational Inference
1. Discrete Latent Variables2. Calculus of Variations3. Continuous Latent Variables4. Interactions between Learning and Inference
• We discuss a subset of the above topics here
11
Deep Learning Srihari
Calculus of Variations
• A function of a function f is known as a functional J [f ]
• We can take functional derivatives wrt to individual values of of the function f (x) at any specific value of x
• Denoted by
12
δδf (x)
J
Deep Learning Srihari
Functional Derivative Identity• For differentiable functions f (x) and
differentiable functions g (y,x) with continuous derivatives
• Intuition: f (x) is an infinite vector indexed by x– Identity is same as for a vector θ ε Rn indexed by
positive integers
• More general is the Euler-Lagrange equation– Allows g to depend on derivatives of f as well as
value of f, but not needed for deep learning 13
δδf (x)
g(f (x),x)dx =∫∂∂y
g(f (x),x)
∂∂θ
i
g(θj
j∑ , j) = ∂
∂θi
g(θi,i)
Deep Learning Srihari
Optimization wrt a vector
• Procedure to optimize a function wrt a vector– Take the gradient of the function wrt the vector– Solve for the point where every element of the
gradient is equal to zero• Procedure to optimize a functional
– Solve for the function where the functional derivative at every point is equal to zero
14
Deep Learning Srihari
Example of Optimizing a Functional• Find maximum entropy distribution over x ε R• Entropy of p(x) is defined as H [p]=-Ex log p(x)• For continuous values expectation is an integral
15
H[p] = p(x)log p(x)dx∫
Deep Learning Srihari
Lagrangian contraints for maxent1. To ensure a distribution, p(x) integrates to 12. Since entropy increases unbounded with
variance, seek distribution with variance σ23. Since problem is undetermined
– as distribution can be arbitrarily shifted without changing entropy we fix the mean to be μ
• Lagrangian functional is
L[p] = λ1p(x)dx∫ − 1( ) + λ
2(E[x ]− µ) + λ
3(E[(x − µ)2 ]− σ2) +H[p]
L[p] = (λ
1p(x)dx + λ
2p(x)x + λ
3p(x)(x − µ2)− p(x)log p(x))dx − λ
1− µλ
2− σ2λ
3∫
Deep Learning Srihari
Functional derivative of Lagrangian• To minimize Lagrangian wrt p, we set the
functional derivatives equal to 0:
• This condition tells us functional form of p(x)– By algebraic rearrangement
p(x)=exp(λ1+λ2x+λ3(x -μ)2-1)
– We never assumed directly that p(x) has this form 17
∀x,
δδp(x)
L = λ1+ λ
2x + λ
3(x − µ)2 − 1 − log p(x) = 0
δδf (x)
g(f (x),x)dx =∫∂∂y
g(f (x),x)
L[p] = (λ1p(x)dx + λ
2p(x)x + λ
3p(x)(x − µ2) − p(x)log p(x))dx − λ
1− µλ
2− σ2λ
3∫y=p(x) g1 (y,x)=y
g2 (y,x)=yxg3(y,x)=y(x-μ)2g4(y,x)=ylogy
ddy
(y logy) = 1 + logy
Deep Learning Srihari
Choosing λ values
• We must choose λ values so that all constraints are satisfied
• We may set λ1=1-log σ√2π. λ2=0, λ3=-1/2σ2 to obtain p(x)=N(x:μ,σ2)
• Because the normal distribution has maximum entropy, we impose the least possible structure by using maximum entropy
18
Deep Learning Srihari
Continuous Latent Variables
• When our graphical model contains continuous latent variables, we can perform variational inference and learning by maximizing 𝓛
• We must now use calculus of variations when maximizing 𝓛 with respect to q(h|v)
• Not necessary for practitioners to solve calculus of variations problems
• Instead there is a general equation for mean-field fixed point updates
19
Deep Learning Srihari
Optimal q(hi|v) for Mean Field
So long as p does not assign 0 probability to any joint configuration of variables.Carrying out expectation yields correct functional form of q(hi|v)
Designed to be iteratively applied for each value of i until convergence
q(h
i|v) = exp(E
hi~q(h− i|v)log p(v,h))
Deep Learning Srihari
Ex: Latent h∈ℝ2 and one visible variable v
21