machine learning for engineers: chapter 10. variational ...machine learning for engineers: chapter...

Machine Learning for Engineers:Chapter 10. Variational Inference and Variational

Expectation Maximization

Osvaldo Simeone

King’s College London

February 7, 2021

Osvaldo Simeone ML4Engineers 1 / 149

This Chapter

In the previous chapters, we have often noted the important roleplayed by soft prediction, also known as Bayesian inference:

I In Chapter 6, we have seen that, once a generative model p(x , t|θ) istrained, optimal prediction requires the computation of the optimal softpredictor, i.e., of the posterior p(t|x , θ) of the target t given the inputx;

I In Chapter 7, we have discussed that, in order to train a modelp(x |θ) = Ez∼p(z|θ)[p(x |z, θ)] with latent variables z for unsupervisedlearning, the EM algorithm requires the computation of the posteriorp(z |x , θ) at each iteration;

I and, similarly, in order to train a mixture modelp(t|x , θ) = Ez∼p(z|θ)[p(t|x , z, θ)] for supervised learning, the EMalgorithm requires the computation of the posterior p(z |x , t, θ) at eachiteration.


This Chapter

What can be done when the calculation of the posterior is toocomplex?

In this chapter, we first review cases in which the posterior can becomputed in closed form, and then cover approximate Bayesianinference methods based on variational inference (VI).

We will also study the application of variational inference for thetraining of latent-variable models, introducing the variational EM(VEM) algorithm, as well as the variational autoencoders as a specialcase of VEM.

We will finally discuss generalized VI and VEM algorithms.


Overview

Exact Bayesian inference

Variational inference

Mean-field variational inference

Black-box variational inference

Reparametrization-based variational inference

Other variational inference methods

Non-parametric variational inference

Amortized variational inference

Variational EM

Generalized Bayesian inference, generalized VI, and generalized VEM


Exact Bayesian Inference


Bayesian Inference

Consider a joint distribution p(x , z) over observable variable x andlatent variable z. Bayesian inference amounts to the computation ofthe posterior distribution

p(z |x) = p(x , z)p(x)

.

This requires marginalizing over the latent variables in order to obtainthe marginal distribution p(x) = Ez∼p(z)[p(x |z)].This computation is only feasible

I if the latent vector is low-dimensional so that the expectation can beevaluated numerically;

I or if the joint distribution has a special structure that enables anexplicitly analytical evaluation of the expectation.


Example

Consider the joint distribution p(x , z) = p(z)p(x |z) defined as

z ∼ p(z) = N (z |0, I ),x|z = z ∼ p(x |z) = N (x |µ(z), I ),

where the mean vector µ(z) is a function of the latent variable z.

Computing the posterior p(z |x) can be done in closed form ifµ(z) = Az for some matrix A. We have encountered this example inChapter 3, where we have seen that the posterior is also Gaussian.We refer to this example as the Gaussian-Gaussian model.

If µ(z) is a non-linear function of z , e.g., defined by a neural network,the problem of computing p(z |x) for any given value of x is ofprohibitive complexity unless the dimension of z is small so thatintegration in the expectation p(x) = Ez∼p(z)[p(x |z)] can be wellapproximated using numerical integration or Monte Carlo techniques.


Conjugate Exponential Family

The Gaussian-Gaussian model is a special case of a general class ofstructured joint distributions that admit an efficient computation ofthe posterior: the conjugate exponential family.

For joint distributions in the conjugate exponential family, theposterior p(z |x) is in the same class of distributions of the prior p(z).As an example, for the Gaussian models, the posterior is Gaussian likethe prior.



Consider a likelihood p(x |z) that, for every fixed value of z , takes theform of the exponential-family distribution

p(x |z) = exp(ηl(z)

T sl(x) + Ml(x)− Al(z))

with D × 1 sufficient statistics vector sl(x) and log-base measurefunction Ml(x). The subscript l indicates that these quantities arerelated to the likelihood.

The dependence of rv x on rv z is through the D × 1 naturalparameter vector ηl(z), which is a general function of z .

The log-partition function Al(η(z)) is denoted as Al(z) for simplicityof notation.



Let us now assume that the prior distribution p(z) also belongs to a,generally different, class of distributions in the exponential family.Accordingly, we write

p(z) = exp(ηTp sp(z) + Mp(z)− Ap(ηp)

),

where the subscript p identifies natural parameters, sufficientstatistics, log-base measure function, and log-partition function forthe prior.

We assume that the prior p(z) is a standard distribution for which thepartition function Ap(ηp) is available in closed form, and henceevaluating p(z) does not require any numerical integration.

As we discussed in Chapter 9, this class of distributions includes manystandard choices for both discrete and continuous distributions.



Now, if we compute the posterior we obtain

p(z |x) ∝ p(z)︸︷︷︸prior

× p(x |z)︸︷︷︸likelihood

∝ exp(sl(x)

Tηl(z) + ηTp sp(z) + Mp(z)− Al(z)

),

where we have highlighted only the terms dependent on z .

For a fixed x , this distribution is generally not easy to normalize, sincethe sufficient statistics (ηl(z), sp(z)) and log-base measureMp(z)− Al(z) do not correspond to one of the standard distributionsin the exponential family.



This is, however, not the case if we choose the sufficient statistics ofthe prior as the (D + 1)× 1 vector

sp(z) =

[ηl(z)−Al(z)

].

This choice of prior is said to be conjugate with respect to thelikelihood p(x |z) defined above.



Plugging this choice in the expression for the posterior, we have

p(z |x) ∝ exp(sl(x)

Tηl(z) + ηTp sp(z) + Mp(z)− Al(z)

)∝ exp

(ηpost(x)

T sp(z) + Mp(z)),

where

ηpost(x) = ηp +

[sl(x)−1

].

The posterior is hence in the same class of distributions as the prior,which is characterized by the pair (sp(z),Mp(z)) of sufficientstatistics and log-base measure function.



Computing the posterior only requires to obtain the naturalparameters ηpost(x) by summing to the natural parameters ηp of theprior a D + 1 vector that consists of the sufficient statistics of thelikelihood, sl(x), and of −1 as the last component.It is emphasized that, since the prior is easy to normalize, so is theposterior.

For every likelihood in the exponential family, one can always find aconjugate prior.

The Gaussian-Gaussian is one example. We will now see another.


Beta-Bernoulli Model

The beta-Bernoulli model consists of a Bernoulli likelihood with abeta prior, which produces a beta posterior.

The likelihood p(x |z) = Bern(x |z) is Bernoulli, with latent variable zidentifying the mean parameter, i.e., the probability of the eventx = 1 conditioned on z = z : We have p(x = 1|z) = z andp(x = 0|z) = 1− z .Using the notation introduced above, we have

p(x |z) = Bern(x |z) = exp

log(

z

1− z

)︸︷︷︸

ηl (z)

x︸︷︷︸sl (x)

− (− log(1− z))︸︷︷︸Al (z)

.



Therefore, the conjugate prior has sufficient statistics

sp(z) =

[log(

z1−z

)log(1− z)

],

and, writing ηp = [ηp,1, ηp,2]T for the vector of natural parameters of

the prior, the conjugate prior is given by

p(z) ∝ exp(ηp,1 log

(z

1− z

)+ ηp,2 log(1− z)

)= zηp,1(1− z)−ηp,1+ηp,2

= za−1(1− z)b−1

=: Beta(z |a, b),

which corresponds to a beta distribution with parameters(a := ηp,1 + 1, b := −ηp,1 + ηp,2 + 1).


Beta-Bernoulli ModelThe parameters (a, b) define the shape of the Beta prior as seen inthe figure. Note that the support of the Beta distribution is theinterval [0, 1].

0 0.2 0.4 0.6 0.8 10

0.5

1

1.5

2

2.5

3

3.5

4

4.5

5



The average and the mode, i.e., the value with the maximum density,of the beta distribution are given as

Ez∼Beta(a,b)[z] =a

a + b

modez ∼Beta(a,b)[z] =a− 1

a + b − 2when a, b > 1.

Therefore, increasing a skews the distribution towards larger values ofz , while increasing b skews the distribution towards smaller values ofz .

Furthermore, the variance of the prior,Var[z ] = ab/((a + b)2(a + b + 1)), decreases with a + b.

Note that the beta prior is a distribution over the space of aprobability.



Due to conjugacy, the posterior is also a beta distribution withnatural parameters

ηpost(x) = ηp +

[sl(x)−1

].

Using the relationship between the natural parameters and theparameters (a, b) of a beta distribution, we can hence write theposterior as p(z |x) = Beta(z |apost(x), bpost(x)), with

apost(x) = ηp,1 + x + 1 = a + x

bpost(x) = −ηp,1 − x + ηp,2 + 1 + 1 = b + (1− x).



In summary, we have

p(z |x) = Beta(z |a + 1(x = 1), b + 1(x = 0)),

which implies that we simply add 1 to a if x = 1 and add 1 to b ifx = 0.

This skews the distribution towards larger values of z if x = 1, andvice versa it skews it towards smaller values of z if x = 0.


Beta-Bernoulli ModelSuppose that we make two conditionally independent observations x1and x2 from this model, so that the joint distribution is

p(x1, x2, z) = Beta(z |a, b) · Bern(x1|z) · Bern(x2|z).

We can write the posterior as

p(z |x1, x2) ∝ (Beta(z |a, b) · Bern(x1|z)) · Bern(x2|z)∝ Beta(z |a + 1(x1 = 1), b + 1(x1 = 0))︸︷︷︸

=p(z|x1)

·Bern(x2|z).

This implies that we can treat p(z |x1) as the prior as we makeanother independent observation x2 ∼ p(x2|z).It follows that we have the posterior

p(z |x1, x2) = Beta

z∣∣∣∣∣∣a +

2∑j=1

1(xj = 1), b +2∑

j=1

1(xj = 0)

.Osvaldo Simeone ML4Engineers 21 / 149

Beta-Bernoulli ModelBy repeating this approach recursively for L times, it follows that theposterior p(z |x1, ..., xL) for the joint distribution

p(x1, ..., xL, z) = Beta(z |a, b)L∏

j=1

Bern(xj |z)

of L conditionally independent observations x1, ..., xL and latentvariable z is given by

p(z |x1, ..., xL) = Beta

z∣∣∣∣∣∣∣∣∣∣∣a +

L∑j=1

1(xj = 1)︸︷︷︸=:L[1]

, b +L∑

j=1

1(xj = 0)︸︷︷︸=:L[0]

.

We can then interpret a and b as “pseudocounts”, i.e., as priorobservations of 1’s and 0’s made before we measure x1, ..., xL.


Example

On an online shopping platform, there are two sellers offering aproduct at the same price:

I The first has 30 positive reviews and 0 negative reviews, while thesecond has 90 positive reviews and 10 negative reviews. Which one tochoose?


Example

Let us introduce a latent variable z ∼ Beta(z |a, b) that describes theprobability of a vendor receiving a positive review.

Then, we model the reviews x1, ..., xL as independent binaryobservations, as seen above, with xi = 1 representing a positivereview.

The posterior p(z |x) is hence given asp(z |x) = Beta(z |a + L[1], b + L[0]), where L[1] counts the number ofpositive reviews and L[0] counts the number of negative reviews.

We have L[1] = 30 with L = 30 for the first seller, and L[1] = 90 withL = 100 for the second.


Example

When the prior is sufficiently strong, i.e., with a + b is large enough andhence variance small enough, the posterior for the second vendor has a largermode. This should suggest caution in choosing the first vendor based on theavailable evidence: Unless the prior is weak, and hence the posterior dependsmostly on the data likelihood, the second vendor appears to be preferable.

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

0

5

10

15

0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

0

5

10



As mentioned, for all distributions in the exponential family, there exists aconjugate prior on the model parameters. The table below provides someexamples. The Dirichlet-categorical example is detailed in the Appendix. Weused the notation Θ′ = Θ0 + LΘ for the last row.

likelihood p(x|z) =∏L

j=1 p(xj |z) conjugate prior p(z) posterior p(z|x)

Bern(xj |z), with z = E[xj ] Beta(a, b) Beta (a + L[1], b + L[0])

Cat(xj |z) with zk = E[xOHk,j ] Dirichlet(α) Dirichlet(α +

∑Lj=1 x

OHj

)N (xj |z,Θ−1), with fixed Θ N (µ0,Θ

−10 ) N

((Θ′)−1

(Θ−10 µ0 + Θ

−1 ∑Lj=1 xj

), (Θ′)−1

)



To conclude this section, it should be mentioned that there are alsoconjugate models with likelihood not in the exponential family.

An example is the uniform likelihood distribution p(x |z) = U(x |[0, z ])with Pareto prior p(z). The posterior p(z |x) is also a Paretodistribution. Recall that the uniform distribution is not in theexponential family.


Variational Inference


Bayesian Inference as Optimization

As we have seen in Chapter 7, the posterior distribution can beobtained as the solution of the minimization of the free energy, i.e.,

minq(z|x)

fe(x , q(z |x)),

where the optimization is over all possible conditional distributionsq(z |x) and the free energy is defined as

fe(x , q(z |x)) = Ez∼q(z|x) [− log p(x , z)]︸︷︷︸=`s(x ,q(z|x)), supervised log-loss

− Ez∼q(z|x) [− log q(z|x)]︸︷︷︸=H(q(z|x)), entropy of the variational posterior



Since the conditional distribution q(z |x) is subject to optimization,we will refer to it as variational posterior in this chapter.

The problem of minimizing the free energy is clearly just as complexas that of computing the posterior p(z |x), since the posterior is theonly solution to this problem.

Variational inference (VI) obtains an approximation of the posteriorp(z |x) by

I restricting the space Q of variational distributions q(z |x) over whichoptimization of the free energy is carried out;

I and adopting a specific, typically local, optimization algorithm to tacklethe resulting problem

minq(z|x)∈Q

fe(x , q(z |x)).



The space of feasible distributions Q can be restricted by imposing:I a factorization of the variational posterior;I a parametrization of the variational posterior;I or both constraints.

Standard algorithms used for the minimization of the free energyinclude

I SGDI coordinate descentI and combinations of both methods.


Mean-Field Variational Inference


Mean-Field VI

We first discuss VI based on the factorization of the variationalposterior.

Denote as z = [z1, ..., zM ]T the M × 1 latent vector. The simplest

factorization of the variational posterior is known as mean field, and itassumes that the variational posterior can be written as the product

q(z |x) =M∏

m=1

qm(zm|x),

where factor qm(zm|x) is the variational posterior distribution forlatent rv zm.

Mean-field VI makes the strong assumption that the latent variablesare independent given x = x .

This assumption can drastically reduce the complexity of Bayesianinference at the cost of causing an irreducible bias when the trueposterior distribution p(z |x) does not factorize as assumed.


Mean-Field VI

To see why mean-field VI can reduce complexity, consider the case inwhich the latent rvs zm ∈ {0, 1} are binary.The unconstrained variational posterior would have 2M − 1parameters to optimize, i.e., all the probabilities q(z |x), for any fixedvalue x . The subtraction by 1 accounts for the constraint that theprobabilities must sum to 1.

For this case, mean-field VI restricts optimization to the subset Q ofdistributions of the form

q(z |x) =M∏

m=1

Bern(zm|µm),

which depends on the M parameters µ = [µ1, ..., µM ]T ∈ [0, 1]M ,

yielding an exponential reduction in the complexity.

Note that parameters µ generally depend on x , although this is notexplicitly indicated by the notation for simplicity.


Mean-Field VI

In mean-field VI, the optimization of the free energy is typically doneby means of coordinate descent.

At each iteration i = 1, 2, ..., we have the current iterate

q(i−1)(z |x) =M∏

m=1

q(i−1)m (zm|x),

where q(i−1)m (zm|x) represents the current iterate for the mth factor.

Then, we pick one of the factors, say m ∈ {1, ...,M}. A typicalapproach is to select successively all factors one by one across Miterations.

With this choice, one set of M iterations, including updates to all Mfactors, is typically referred to as a training epoch.


Mean-Field VI

To fully specify mean-field VI, we finally need to describe how toobtain q∗m(zm|x).As we show in the Appendix, this can be computed as

q∗m(zm|x) ∝ exp(Ez−m [log p(x , zm, z−m)]

),

where z−m ∼ q(i−1)−m (z−m|x) and we have made explicit thedependence of the supervised log-loss − log p(x , z) on both zm andz−m, with the latter including all latent variables values except for zm.

So, computing q∗m(zm|x) requires averaging the supervised log-lossover all other variables z−m and then normalizing.

The latter step is easy if rv z−m is low-dimensional or if it takes adiscrete and small number of values.

In contrast, the average over z−m is generally problematic, unless thelog-loss − log p(x , zm, z−m) can be written as the sum of terms, eachdependent on a small subset of variables.


Example

In the Ising model, the latent variables {zm}Mm=1 are bipolar, i.e.,zm ∈ {−1,+1}, for m = 1, ...,M; and there are M observations{xm}Mm=1, also bipolar, i.e., xm ∈ {−1,+1}. Each observation xm isassociated with the corresponding latent variable zm.

The correlation among the latent variables is described by a graphthat has one node for each rv zm and an edge between any two“correlated” rvs.

This is an example of a probabilistic graphical model formalism knownas Markov networks.


ExampleAs an example, consider a setting in which the observed variablesxm ∈ {−1,+1} correspond to the M pixel values of a black-and-whitenoisy image with −1 corresponding to a white pixel and +1 to ablack pixel. The latent variable zm represents the noiseless value forthe mth pixel.

Adjacent pixels are expected to be correlated. We can model this byadding an edge between a pixel and the four immediately adjacentpixels in the four directions up, down, left and right.


ExampleDefine as E ⊆ {1, ...,M} × {1, ...,M} the set of edges (m,m′) in thegraph between latent variable zm and zm′ .In the Ising model, the joint distribution factorizes as

p(x , z) ∝ exp

η1 ∑(m,m′)∈E

zmzm′ + η2

M∑m=1

zmxm

=

∏(m,m′)∈E

exp (η1zmzm′) ·M∏

m=1

exp (η2zmxm) .

From this definition, a large (natural) parameter η1 > 0 yields a largeprobability when zm and zm′ with (m,m

′) ∈ E are equal; and,similarly, a large η2 > 0 favors configurations in which zm = xm, thatis, with low “observation noise”.Parameters η1 and η2 are assumed to be fixed and they are notsubject to training.It can be seen that the true posterior p(z |x) cannot be written in theproduct form assumed by mean-field VI.Therefore, mean-field VI can only obtain an approximation of the trueposterior.


ExampleAt each iteration, mean-field VI obtains

q∗m(zm|x) ∝ exp

η1zm∑

m′:(m,m′)∈E

Ezm′∼q

(i−1)m′ (zm′ |x)

[zm′ ]︸︷︷︸:=µ

(i−1)m′

+η2xmzm

.Note that we have

µ(i−1)m′ = q

(i−1)m′ (zm′ = 1|x)− q

(i−1)m′ (zm′ = −1|x) = 2q

(i−1)m′ (zm′ = 1|x)− 1.

Imposing normalization so that condition q∗m(zm = 1|x) + q∗m(zm = −1|x) = 1holds, we obtain the normalization constant as

exp

η1 ∑m′:(m,m′)∈E

µ(i−1)m′ + η2xm

+ exp−η1 ∑

m′:(m,m′)∈E

µ(i−1)m′ − η2xm

,which yields

q∗m(zm = 1|x) = σ

2η1 ∑

m′:(m,m′)∈E

µ(i−1)m′ + η2xm

.Osvaldo Simeone ML4Engineers 41 / 149

Example

For a numerical example, consider a 4× 4 binary image z observed asnoisy matrix x, where the joint distribution of x and z is given by theIsing model.

Note that, according to this model, the observed matrix x is such thateach pixel of the original matrix z is flipped independently withprobability σ(−2η2), i.e., we have p(x |z) =

∏m p(xm|zm) with

p(xm 6= zm|zm) = σ(−2η2).In this small example, it is easy to generate an image x distributedaccording to the model, as well as to compute the exact posteriorp(z |x) by enumeration of all possible images z .The KL divergence KL(p(z |x)||q(z |x)) between the true posteriorand the mean-field VI approximation obtained at the end of eachepoch – with one epoch making one iteration across all variables – isshown in the next figure for η1 = 0.15 and various values of η2.


Example

As η2 increases, the posterior distribution tends to become deterministic,since x is an increasingly accurate measurement of z. As a result, the finalmean-field approximation is more faithful to the real posterior, since aproduct distribution can capture a deterministic pmf. For smaller values ofη2, however, the bias due to the mean-field assumption yields a large flooron the achievable KL divergence.

0 1 2 3 4

10-4

10-2

100


VI Based on the Factorization of the Variational Posterior

Mean-field VI is a message passing algorithm in the sense that, ateach iteration, the scheduled node zm passes its updated variationalposterior q∗m(zm|x) to the other nodes for further processing.Mean-field VI can be generalized by considering factorizations overthe latent variables in which the scopes of the factors, i.e., the set oflatent rvs they depend on, are possibly overlapped.

This yields more complex message passing algorithms, and is a vastarea of research.


Parametric Variational Inference



In parametric VI, the optimization domain Q consists of variationalposteriors q(z |x , ϕ) that are parametrized by a vector ϕ.Vector ϕ is referred to as the variational parameter vector.

With parametric VI, we are therefore interested in minimizing the free energy

minϕ

fe(x , q(z |x , ϕ))

with respect to the variational parameters ϕ.

Note that the optimization is implicitly constrained within a given feasibilitydomain. For instance, if ϕ is a vector of probabilities for binary variables asin the example above, all entries should be in the interval [0, 1].

It is also important to emphasize that the optimization is done hereseparately for each value x . Therefore, the parameter ϕ that solves the VIproblem above is a function of x , although this is not made explicit by thenotation for simplicity.

We will discuss later how the optimization can be “amortized” across allvalues x .



A standard solution for parametric VI is to use GD. To implementGD, we need to compute the gradient of the free energy∇ϕfe(x , q(z |x , ϕ)).The complexity of this computation depends on the specific choice forjoint distribution p(x , z) and variational posterior family q(z |x , ϕ).We now first review a general-purpose method known as black-box VIthat applies broadly to most choices of p(x , z) and variationalposterior family q(z |x , ϕ), and we will then study a potentially moreefficient solution that assumes that the variational posterior q(z |x , ϕ)has a specific “reparametrizable” form.


Black-Box Variational Inference


Stochastic Optimization Over Averaging DistributionTo start, let us consider a related problem, which we will connectback to VI later on.

The problem of interest is the stochastic optimization of an averagewith respect to the averaging distribution. In mathematical terms, theproblem is defined as the minimization

minϕ

Ez∼q(z|ϕ)[g(z)]

for some scalar function g(z) of a vector z .

The minimization is hence over the parameter vector ϕ of thedistribution over which we compute the average.

Note that, as we will detail later, the problem of minimizing the freeenergy over the variational parameters has a similar form.

We will be specifically interested in GD solutions, which requirecomputing the gradient

∇ϕEz∼q(z|ϕ)[g(z)].


Example

As a simple example, consider the problem above withg(z) = 12 (z

21 + 5z

22 ) and q(z |ϕ) = N (z |ϕ, I2), where φ ∈ R2 and

z = [z1, z2]T .

In this case, we can directly write the problem as

minϕ

{Ez∼q(z|ϕ)[g(z)] =

1

2(Ez∼q(z|ϕ)[z

21 + 5z

22]) =

1

2(ϕ21 + 5ϕ

22 + 6)

}.

The gradient can also be directly computed as

∇ϕEz∼q(z|ϕ)[g(z)] =[ϕ1

5ϕ2

]and the optimal solution is given as ϕ1 = ϕ2 = 0.


Stochastic Optimization Over Averaging Distribution

In general, it is not possible to compute the objective Ez∼q(z|ϕ)[g(z)],as well as its gradient ∇ϕEz∼q(z|ϕ)[g(z)], in closed form.We assume that we can, however, evaluate g(z) for any fixed given zat a reasonable computational cost.

As an example, consider the case in which g(z) is the output of aneural network with input given by z : Computing g(z) is feasible, butaveraging over all possible values of the input as in Ez∼q(z|ϕ)[g(z)] isgenerally not possible.

How can we obtain an estimate of ∇ϕEz∼q(z|ϕ)[g(z)] by evaluatingg(z) on just a few values of z?


REINFORCE Gradient

To address this question, we first note that we cannot just draw one,or more, samples z ∼ q(z |ϕ) and directly approximate the expectationEz∼q(z|ϕ)[g(z)] with an empirical average.

In fact, the parameters ϕ over which we wish to differentiatedetermine the sampling distribution q(z |ϕ). Therefore, the gradientneeds to capture by how much distribution q(z |ϕ) – and, through it,the generated samples – vary with ϕ.

A solution is given by the REINFORCE gradient, which is animportant element in reinforcement learning algorithms.

The connection to reinforcement learning is as follows: Inreinforcement learning, an agent takes an action z ∼ q(z |ϕ) byfollowing policy q(z |ϕ). The goal is to optimize the policy parametersϕ such that the average loss Ez∼q(z|ϕ)[g(z)] is minimized.


REINFORCE Gradient

The derivation of the REINFORCE gradient starts with the followingequality

∇ϕEz∼q(z|ϕ)[g(z)] = Ez∼q(z|ϕ)

g(z)︸︷︷︸loss

· ∇ϕ log q(z|ϕ)︸︷︷︸score vector for q(z|ϕ)

.The proof of this equality follows directly from the trivial identity

∇ϕ log q(z |ϕ) =∇ϕq(z |ϕ)q(z |ϕ)

and is detailed in the Appendix.

In words, this formula says that the gradient is the weighted average,over q(z |ϕ), of the score vector ∇ϕ log q(z |ϕ), with weights given bythe loss function g(z).


REINFORCE Gradient

To interpret this formula, recall that the score vector ∇ϕ log q(z |ϕ)points in the direction that maximally increases the probability ofvalue z in the space of parameters ϕ (see Chapter 6).

With this interpretation in mind, the REINFORCE gradient, whenapplied to update ϕ, “pushes” the resulting updated distributionq(z |ϕ) towards values z that have a large value g(z).The key advantage of this formula is that it is expressed as theexpectation over q(z |ϕ), and hence it can be estimated by drawingsamples from q(z |ϕ).


REINFORCE Gradient

Accordingly, the REINFORCE gradient algorithm is defined by thefollowing procedure:

I draw S i.i.d. samples zs ∼ q(z |ϕ) for s = 1, ...,S ;I estimate the gradient as ∇ϕEz∼q(z|ϕ)[g(z)] ' ∇̂ϕEz∼q(z|ϕ)[g(z)] with

∇̂ϕEz∼q(z|ϕ)[g(z)] =1

S

S∑s=1

(g(zs)− cs) · ∇ϕ log q(zs |ϕ),

where cs is an arbitrary constant, not dependent on zs , known asbaseline.


REINFORCE Gradient

Importantly, the REINFORCE gradient only requires that functiong(z) and score vector ∇ϕ log q(z |ϕ) be computable for a few, here S ,values of z .

In particular, it does not require the derivatives of the function g(z).

The REINFORCE estimator can be easily seen to be unbiased for anyconstant c by using the fact that the score vector has zero mean, i.e.,Ez∼q(z|ϕ)[∇ϕ log q(z|ϕ)] = 0 (see Chapter 9). In particular, we have

Ez∼q(z|ϕ)[(g(z)− c) · ∇ϕ log q(z|ϕ)] = ∇ϕEz∼q(z|ϕ)[g(z)].

The baseline cs , which may depend on the samples zs′ with s′ 6= s, is

useful to reduce the variance of the REINFORCE estimator.

In practice, as we will see below, the baseline cs is typically selectedas an average of the values g(z) observed at prior iterations.

It is also common to set S = 1.


Example

Continuing the example above with g(z) = 12 (z21 + 5z

22 ) and

q(z |ϕ) = N (z |ϕ, I2), we can obtain an estimate of the gradient, withS = 1, as:

I draw one sample z ∼ q(z |ϕ);I estimate the gradient as

∇̂ϕEz∼q(z|ϕ)[g(z)] =(g(z)− c) · ∇ϕ log q(z|ϕ)=(g(z)− c) · (z− ϕ).

Note that we have used the general formula for the gradient of anexponential family (see Chapter 9).

So, the gradient points in the direction of the randomly selected z toan extent that is proportional to the value g(z).


REINFORCE Gradient

Let us summarize what we have learned so far by returning to theoriginal problem

minϕ

Ez∼q(z|ϕ)[g(z)].

Setting S = 1 for simplicity, a GD method that leverages theREINFORCE gradient algorithm would work as follows:

initialize ϕ(1) (e.g., randomly)

for i = 1, 2, ...I draw one sample z(i) ∼ q(z |ϕ(i));I given a learning rate γ(i) > 0, obtain next iterate as

ϕ(i+1) ← ϕ(i) + γ(i)(c(i) − g(z(i))) · ∇ϕ log q(z(i)|ϕ(i));

I if stopping criterion satisfied, then return ϕ(i+1); otherwise, continue.

Note that this is a form of SGD in which the randomness is not overthe selection of the data but over the sampling of latent rvs.


REINFORCE GradientThe REINFORCE gradient algorithm can be thought of as aperturbation-based optimization scheme:

I Given the current parameter vector ϕ(i), one or more samplesz(i) ∼ q(z |ϕ(i)) are generated to “explore” values of the latent variablesz to which the model q(z |ϕ(i)) assigns a sufficiently large probability;

I samples that yield a small value of the objective g(z(i))− c(i) are“reinforced”, in the sense that the updated probability distributionq(z |ϕ(i+1)) assigns a large probability to such values of z.

Being a perturbation-based scheme, it generally suffers from the curseof dimensionality, since exploration of the space of the variables zgenerally requires a larger number of samples as the dimension of zincreases.

This translates into a large variance for the REINFORCE gradient asthe dimension of the vector z increases. To see this, note that, by thechain rule, the score vector can be written as the sum of terms∇ϕ log q(z(i)|ϕ(i)) =

∑Mm=1∇ϕ log q(z

(i)m |ϕ(i)) for an M-dimensional

vector z(i) = [z(i)1 , ..., z

(i)M ]

T .


REINFORCE Gradient

How to select the baseline c(i)?

From the update formula

ϕ(i+1) ← ϕ(i) + γ(i)(c(i) − g(z(i))) · ∇ϕ log q(z(i)|ϕ(i)),

we see that the parameter ϕ is updated so as to increase or decreasethe probability of the sampled value z(i) depending on the sign of thedifference (c(i) − g(z(i))).Intuitively, we would like this difference to be positive – so that weincrease the probability of z(i) – when g(z(i)) is smaller than the lossg(z(i−1)) obtained at the previous iterate or of some average1L

∑Lj=1 g(z

(i−j)) of L prior iterates.

This suggests setting c(i) = 1L∑L

j=1 g(z(i−j)).

In the Appendix, we derive an optimized formula for c(i), which leadsto a different formula.


ExampleContinuing the example above, the figure shows the evolution withoutbaseline c(i) = 0 and with baseline with L = 1 over 1000 iterations withγ(i) = 0.2, S = 1, and σ = 0.1 for one realization of the updates. Theinitialization is ϕ(0) = [0.5,−1]T .Note the reduced “variance” of the updates with the baseline.


Black-Box VIHow does the REINFORCE gradient for stochastic optimization over anaveraging distribution relate to VI?

Fix a value x . The key observation is that the objective of black-box VI,that is, the free energy, can be written as

fe(x , q(z |x , ϕ)) = Ez∼q(z|x,ϕ)

log(q(z|x , ϕ)p(x , z)

)︸︷︷︸

=:g(z|ϕ)

.Therefore, the problem of minimizing the free energy is exactly in the formof an optimization over an averaging distribution with the caveat that thefunction being averaged also depends on the parameter ϕ. Note again thatthis optimization is done separately for each value of x .

As it is shown in the Appendix, this new aspect does not change theREINFORCE gradient formula, which reads

∇ϕfe(x , q(z |x , ϕ)) = Ez∼q(z|x,ϕ)[

log

(q(z|x , ϕ)p(x , z)

)· ∇ϕ log q(z |x , ϕ)

].


Black-Box VI

We can therefore use the REINFORCE gradient to estimate thegradient ∇ϕfe(x , q(z |x , ϕ)).This leads to the following procedure for parametric VI, which appliesto any choice of joint distribution p(x , z) and variational distributionclass q(z |x , ϕ), as long as the score vector ∇ϕ log q(z |x , ϕ) iscomputable.

Recall again that x is fixed here.


Black-Box VI


for i = 1, 2, ...I draw one sample z(i) ∼ q(z |x , ϕ(i));I obtain next iterate as

ϕ(i+1) ← ϕ(i)+γ(i)(c(i) − log

(q(z(i)|x , ϕ(i))p(x , z(i))

))·∇ϕ log q(z(i)|x , ϕ(i));


Following the discussion above, the baseline can be set as

c(i) = 1L∑L

j=1 log(q(z(i−j)|x ,ϕ(i−j))

p(x ,z(i−j))

).


Example

Consider the following joint distribution p(x , z) = p(z)p(x |z):

z ∼ p(z) = Beta(z |a, b)

x|z ∼ p(x |z) = Exp(x |z) = 1z

exp(−xz

)1(x ≥ 0).

Note that this model is not conjugate, although both prior andlikelihood are distributions in the exponential family, and that the trueposterior is given as

p(z |x) = Beta(z |a, b)× Exp(x |z)∫Beta(z ′|a, b)× Exp(x |z ′)dz ′

.

Recall also that a beta distribution Beta(z |a, b) is supported on theinterval [0, 1] and that a ≥ 1 determines how much the distribution isconcentrated towards 1, while b ≥ 1 plays the same role for the otherend of the support.


Example

Let us apply black-box VI by assuming a beta posteriorBeta(z |ϕ1, ϕ2) for some fixed value x . We can obtain naturalparameters and mean parameters from relevant tables concerningexponential family distributions.

Furthermore, by the general formula for the score vector for theexponential family, we have

∇ϕ logBeta(z |ϕ1, ϕ2) = s(z)− µ

=

[log z

log(1− z)

]︸︷︷︸

s(z)

−[ψ(ϕ1)− ψ(ϕ1 + ϕ2)ψ(ϕ2)− ψ(ϕ1 + ϕ2)

]︸︷︷︸

µ

with digamma function ψ (psi in MATLAB).


ExampleThe quality of the approximation obtained by black-box VI depends on howwell a beta distribution can describe the true posterior distribution (in thefigure, we set x = 1, 2000 iterations, S = 30, c(i) = 0, γ(i) = 0.03 fora = b = 1, γ(i) = 0.1 for a = b = 10).


Reparametrization-Based VariationalInference


Reparametrization Trick

To introduce the idea of reparametrization, let us start again byconsidering the minimization over the averaging distribution

minϕ

Ez∼q(z|ϕ)[g(z)]

for some function g(z).

As we have seen, we are interested in computing the gradient

∇ϕEz∼q(z|ϕ)[g(z)].



The REINFORCE gradient only requires that the score vector∇ϕ log q(z |ϕ) be computable.In contrast, the reparametrization trick makes the followingassumption on q(z |ϕ):

I the rv z ∼ q(z |ϕ) can be generated byF generating an auxiliary variable e ∼ q(e) with distribution q(e) not

dependent on ϕ;F and setting z = r(e|ϕ) for some function r(e|ϕ) that is differentiable inϕ for every fixed value of e.



A typical example, but not the only one, is the Gaussian distributionq(z |ϕ) = N (z |ν,Diag(σ2)), for which the vector of parameters is the2M × 1 vector ϕ = (ν, σ) containing the M × 1 mean vector ν, andthe vector of standard deviations σ. The notation σ2 represents theM × 1 vector of variances.In fact, we can generate a sample z ∼ N (z |ν,Diag(σ2)) as follows

I generate auxiliary variable

e ∼ q(e) = N (e|0, I )

I and set

z = r(e|ϕ) = ν + σ � e = ν + Diag(σ)e,

where � denotes the element-wise product.Note that function r(e|ϕ) is linear, and hence differentiable, inϕ = (ν, σ).



We note that any scalar variable z ∼ q(z |ϕ) can be generated asfollows:

I generate auxiliary variable e ∼ U(e|[0, 1])I and set z = F−1(e|ϕ), where F (z |ϕ) = Pr[z ≤ z ] is the cumulative

distribution function of rv z ∼ q(z |ϕ).The function F−1(e|ϕ) may, however, not be differentiable, e.g., fordiscrete variables.

Furthermore, it may not be the most convenient representation, e.g.,for the Gaussian example above.



Assuming that the variational posterior satisfies thereparameterization property, we can now write the optimizationproblem of interest as

minϕ

Ee∼q(e)[g(r(e|ϕ))].

The key point is that now the averaging distribution does not dependon the parameter vector ϕ.

Note that, by the chain rule of differentiation, we have the gradient

∇ϕg(r(e|ϕ)) = ∇ϕr(e|ϕ) · ∇zg(z)|z=r(e|ϕ),

where ∇ϕr(e|ϕ) is the Q ×M Jacobian of function r(e|ϕ) for fixed e,with Q being the dimension of ϕ.



Therefore, we can obtain an unbiased estimate of the gradient directlyby following this procedure, known as reparametrization trick:

I draw S i.i.d. samples es ∼ q(e) for s = 1, ...,S ;I estimate the gradient as ∇ϕEz∼q(z|ϕ)[g(z)] ' ∇̂ϕEz∼q(z|ϕ)[g(z)] with

∇̂ϕEz∼q(z|ϕ)[g(z)] =1

S

S∑s=1

∇ϕr(es |ϕ) · ∇zg(zs)|zs=r(es |ϕ).

Unlike the REINFORCE gradient, the reparametrization-basedgradient depends on the derivative ∇zg(z) of the function to beoptimized, and it generally has a lower variance than the REINFORCEgradient.


Example

Let us apply the reparametrization trick to the Gaussian distributionq(z |ϕ) = N (z |ν,Diag(σ2)).We parametrize σ as σ = exp(%), applied as an element-wisefunction, and optimize over vector % in order to avoid gradient stepsyielding negative values for the standard deviations.

The parameter vector is hence given as ϕ = (ν, %).

Since we have r(e|ϕ) = ν + σ � e = ν + exp(%) � e, the relevantJacobians can be computed as

∇νr(e|ϕ) = I∇%r(e|ϕ) = Diag(exp(%) � e).


Example

The resulting reparametrization-based estimate∇̂ϕEz∼N (z|ν,Diag(σ2))[g(z)] =[∇̂νEz∼N (z|ν,Diag(σ2))[g(z)]T , ∇̂%Ez∼N (z|ν,Diag(σ2))[g(z)]T ]T , statedfor S = 1 for simplicity of notation, is obtained as:

I draw sample e ∼ N (e|0, I )I compute z = ν + σ � e, and set

∇̂νEz∼N (z|ν,Diag(σ2))[g(z)] = ∇zg(z)∇̂%Ez∼N (z|ν,Diag(σ2))[g(z)] = Diag(exp(%) � e)∇zg(z)

= exp(%) � e �∇zg(z).


ExampleContinuing the example above with g(z) = z21 + 5z

22 and Gaussian variational

posterior, the figure shows the evolution of the loss g(ν) evaluated at the mean ofthe distribution q(z |ϕ) for both REINFORCE gradient (with baseline, σ = 1 andγ(i) = 0.3) and with the reparametrization trick (with γ(i) = 0.05) over 100iterations for 10 realizations.

The reparametrization gradient has a much reduced variance as is clear from thefaster, and less noisy, convergence in the figure.

0 10 20 30 40 50 60 70 80 90 100

0

1

2

3

4

0 10 20 30 40 50 60 70 80 90 100

0

1

2

3


Reparametrization-Based VI

How can the reparametrization-based gradient for stochasticoptimization over an averaging distribution be applied to VI?

As we have observed, the free energy can be written as

fe(x , q(z |x , ϕ)) = Ez∼q(z|x ,ϕ)

log(q(z|x , ϕ)p(x , z)

)︸︷︷︸

=:g(z|ϕ)

,and hence the function being averaged also depends on ϕ. Note againthat the value of x is fixed here.

We need to address this dependence in order to apply thereparametrization trick.


Reparametrization-Based VIA common approach is to express the free energy as (see Appendix ofChapter 7)

fe(x , q(z |x , ϕ)) = Ez∼q(z|x ,ϕ) [− log p(x |z)]︸︷︷︸average log-loss of the soft predictor p(x |z)

+ KL(q(z |x , ϕ)||p(z))︸︷︷︸KL divergence between variational posterior and prior

.

We can then directly use the reparametrization trick for the first termsince the function being optimized does not depend on ϕ.For the second term, if we choose variational posterior q(z |x , ϕ) andprior p(z) to be from the same class in the exponential family, the KLdivergence can be computed in closed form (see Chapter 9), and socan its gradient.As an example, if we choose a Gaussian prior and a Gaussianvariational posterior, we can estimate the gradient of the first termusing the reparametrization trick, while the gradient of the secondterm can be computed in closed form.



To elaborate on the gradient of the second term, let us select theprior to be from a class of distributions in the exponential family witha fixed mean parameter µp, i.e.,

p(z) = ExpFam(z |µp),

and the variational posterior be in the same class with

q(z |x , ϕ) = ExpFam(z |µ(ϕ)),

where the mean parameter vector µ(ϕ) is a function of the variationalparameter ϕ.

We assume minimal parametrization, and the corresponding naturalparameters are defined as ηp and η(ϕ).



Using the analytical form of the KL divergence, the gradient of theentropy of the KL divergence KL(q(z |µ)||p(z |µp)) with respect to µcan be computed as (see Appendix)

∇µKL(q(z |µ)||p(z |µp)) = η − ηp,

and hence we have

∇ϕKL(q(z |µ(ϕ))||p(z |µp)) = ∇ϕµ(ϕ) · (η(ϕ)− ηp).

Note that if the variational parameters ϕ are the natural parametersof distribution q(z |x , ϕ), as discussed in the previous chapter, naturalgradient descent would require the application of the gradient withrespect to µ in the first equation. This can simplify theimplementation.


Example

For the standard case of Gaussian distributions, recalling that for N (ν,Σ)the mean parameters are given as µ = [ν,Σ + ννT ], we have the gradients

∇νKL(N (ν,Σ)||N (νp,Σp)) = Σ−1p (ν − νp)

∇ΣKL(N (ν,Σ)||N (νp,Σp)) =1

2

(Σ−1p − Σ−1

),

and hence as a special case we have

∇νKL(N (ν,Diag(σ2))||N (0, I )) = ν

∇σKL(N (ν,Diag(σ2))||N (0, I )) = σ − σ−1,

where the inverse in σ−1 is applied element-wise.

Based on the derivations above, with q(z |x , ϕ) = N (z |ν,Diag(σ2)) withϕ = (ν, %) and σ = exp(%), as well as p(z) = N (z |0, I ),we have thefollowing reparametrization-based VI algorithm.



initialize ϕ(1) = (ν(1), %(1)) (e.g., randomly)

for i = 1, 2, ...

I draw one sample e(i) ∼ N (0, I );I compute z(i) = ν(i) + exp(%(i)) � e(i), and estimate the gradients as

∇̂ν fe(x , q(z |x , ϕ(i))) = ∇z(− log p(x |z(i))) + ν(i)

∇̂%fe(x , q(z |x , ϕ(i))) = exp(%(i)) � e(i) �∇z(− log p(x |z(i)))+ exp(%(i)) � (σ(i) − (σ(i))−1)

I obtain the next iterate as

ϕ(i+1) ← ϕ(i) − γ(i)∇̂ϕfe(x , q(z |x , ϕ(i)))

where∇̂ϕfe(x , q(z |x , ϕ(i))) = [∇̂ν fe(x , q(z |x , ϕ(i)))T , ∇̂%fe(x , q(z |x , ϕ(i)))T ]T .



Example

Let us consider an example for which the posterior distribution can becomputed exactly in order to validate reparametrization-based VI.

Specifically, consider the joint distribution given by z ∼ N (0, 1) andx|z = z ∼ N (z , β−1) for some precision β−1.From Chapter 3, we know that the posterior distribution is given as

p(z |x) = N(z

∣∣∣∣ βx1 + β , 11 + β).

Let us now assume a variational posterior q(z |ϕ) = N (z |ν, σ2).We plot the KL divergence KL(q(z |ϕ(i))||p(z |x)) =12

[(1 + β)(σ(i))2 − log

((1 + β)(σ(i))2

)+ (1 + β)( βx1+β − ν

(i))2 − 1].


Example

The dashed arrow points in the direction of increasing values of theiteration index i , showing an increasing accurate estimate of thecorrect prior (β = 0.5, x = 1, ϕ(0) = [−2, 0.1]T , γ(i) = 0.01).

-3 -2 -1 0 1 2 3

0

0.5

1

0 10 20 30 40 50 60 70 80 90 100

0

2

4

6


Other Parametric Approximation Methods


Factorized and Parametric Variational PosteriorsWe have seen above that VI can be applied on the space of factorizeddistributions, in which case optimization is typically done via coordinatedescent; or by assuming parametric variational posteriors, in which caseoptimization is typically done by SGD.

The two approaches can be combined: The free energy can be optimizedover the space Q of variational posteriors that factorize as

q(z |x , ϕ) =M∏

m=1

qm(zm|x , ϕm),

where each factor qm(zm|x , ϕm) is a parametric function.More general factorizations over the latent variables in which the scopes ofthe factors are possibly overlapped are also naturally possible.

Minimization of the free energy can be carried out by coordinate-wise SGD,in which the optimization over each parameter vector ϕm is done iterativelyvia SGD.

The derivation follows from a direct combination of mean-field VI andparametric VI methods seen above.


Laplace Approximation

Variational inference is not the only way to obtain a parametricapproximation of a posterior distribution.

A simpler approach is known as Laplace approximation.

The idea is to fit a Gaussian distribution around the maximum aposteriori (MAP) solution zMAP = arg maxz p(x , z). Note that thiscorresponds to the point at which the posterior is maximized.


Laplace Approximation

Laplace approximation chooses a Gaussian distributionq(z |x) = N (z |zMAP ,Θ−1), where the precision matrix Θ is chosen to matchthe Hessian (wth respect to z) of the log-loss − log p(z |x) at z = zMAP .The Hessian of the negative log-posterior − log p(z |x) with respect to zequals the Hessian of the negative log-joint distribution − log p(x , z). This isbecause the multiplicative normalization term needed to obtain the posteriorfrom the joint does not depend on the value of z . This property is importantsince one does not have access to p(z |x),which we are trying toapproximate, but only to − log p(x , z).To elaborate, note that the Hessian of − log q(z |x) = − logN (z |zMAP ,Θ−1)with respect to z is equal to Θ. Accordingly, Laplace approximationcomputes the Hessian

∇2z(− log p(z |x)) = ∇2z(− log p(x , z))

with respect to z , and then sets Θ = ∇2z(− log p(x , z)).Laplace approximation can be quite accurate for posterior distributions thathave a single mode.


Example

The figure below shows the Laplace approximation for the beta-exponentialexample introduced above. The approximation is more accurate the closerthe posterior distribution is to a Gaussian distribution.

0 0.2 0.4 0.6 0.8 1 1.2

0

1

2

3

4

0 0.2 0.4 0.6 0.8 1 1.2

0

0.5

1

1.5

2


Non-Parametric VI


Non-Parametric VI

In parametric VI, the variational posterior q(z |x , ϕ), for any fixed x , isparametrized by a vector ϕ of parameters. An example is given bymean and variance vectors for Gaussian variational posteriors.

When the true posterior p(z |x) cannot be well approximated by anyof the distributions in the set of variational posteriors q(z |x , ϕ), thesolution of the parametric VI problem is impaired by an irreduciblebias.

For instance, consider the case in which the true posterior ismulti-modal, while the variational distribution is chosen as Gaussian,and it is hence single-modal. The variational posterior cannot capturemore than one mode of the posterior.

As we discuss next, an alternative approach is to parametrize thevariational distribution with a flexible number of “particles” in thespace of the latent variables through a kernel density estimator(KDE).


Non-Parametric VIIn Chapter 7, we have seen that a distribution q(z) can beapproximated with a number S of “particles” {zs}Ss=1 using the kerneldensity estimator (KDE)

q(z) ' 1S

S∑s=1

κh(z − zs),

where kh(z) is a (non-negative) kernel function with “bandwidth” h,such as κh(z) = N (z |0, h).This suggest to parametrize the variational posterior with a setϕ = {zs}Ss=1 of particles, as

q(z |x , ϕ) = 1S

S∑s=1

κh(z − zs).

The particles depend implicitly on the value x as for parametric VI.Unlike parametric VI, the parameters ϕ = {zs}Ss=1 do not impose aspecific structure on the variational posterior apart from thesmoothness properties implied by the choice of a particular kernel.


Non-Parametric VI

The outlined “non-parametric parametrization” of the variationalposterior can flexibly represent any posterior distribution as long asthe number of particles is large enough and the posterior distributionis sufficiently smooth.

As all non-parametric techniques, this approach suffers from the curseof dimensionality: As the dimensionality of the latent space increases,the number of required particles generally increases exponentially.

That said, for sufficiently small dimensions of the latent vector z , theflexibility of the non-parametric model can yield significantly moreaccurate approximations of the posterior that are not limited by thebias of parametric models.


Stein Variational Gradient Descent (SVGD)Non-parametric VI is hence formulated as the problem of minimizingthe free energy over the particles ϕ = {zs}Ss=1, i.e.,

minϕ

fe(x , q(z |x , ϕ)).

Stein Variational Gradient Descent (SVGD) is an iterative methodthat updates a set of particles in the direction that locally minimizesthe free energy.

Given a set of particles ϕ(i−1) = {z(i−1)s }Ss=1 at iteration i − 1, SVGDapplies a smooth transformation r(·) to all particles as:

z(i)s ←− z(i−1)s + γ(i)r(z(i−1)s ), for s = 1, ...,S

with learning rate γ(i).

The transformation r(·) is optimized to maximize the rate of localdescent of the free energy with a specific functional class known asreproducing kernel Hilbert space (RKHS).


Stein Variational Gradient Descent (SVGD)

A RKHS is determined by a positive semidefinite kernel functionκ(z , z ′). In particular, an RKHS contains functions that can beexpressed as linear combinations of functions κ(z , ·) for a number ofvalues z .

As discussed in Chapter 4 (see Appendix), a positive semi-definitekernel function κ(z , z ′) can be thought of defining the degree of“similarity” between two vectors z and z ′.

A typical example of a positive semi-definite kernel function is the“Gaussian” kernel κ(z , z ′) = exp(−||z − z ′||2). Note that the kernelneed not be normalized.

Note also that the positive semi-definite kernel function κ(·, ·) isdifferent, and conceptually different, from the non-negative kernelfunction κh(·) used for KDE.By restricting optimization within a specific subset of a RKHS andapproximating the resulting expectation, it can be proven that theoptimal update operates as summarized in the following algorithm.



initialize particles ϕ(1) = {z(1)s }Ss=1 (e.g., randomly)for i = 1, 2, ...

I for all particles s = 1, ...,S , obtain the next iterate as

z (i+1)s ←− z (i)s + γ(i)r (i)s

with

r (i)s =S∑

s′=1

κ(z (i−1)s , z(i−1)s′ )

[∇zs′ log p(x , z

(i−1)s′ )︸︷︷︸

negative supervised loss gradient

]

+∇zs′κ(z(i−1)s , z

(i−1)s′ )︸︷︷︸

repulsive “force”

;

I if stopping criterion satisfied, then return ϕ(i+1) = {z (i+1)s }Ss=1;otherwise, continue.



As a result, each particle is updated in a direction that depends on allother particles, and:

I The first term moves each particle zs in the direction that minimizesthe supervised loss by weighting the contribution of each other particlezs′ by their similarity κ(zs , zs′);

I and the second term ensures that particles are repulsed if they are “toosimilar”.

To understand the operation of the repulsive force due to the secondterm, consider the use of the kernel κ(zs , zs′) = exp(−||zs − zs′ ||2),for which we have

∇zs′k(zs , zs′) = 2(zs − zs′)κ(zs , zs′).

This contribution to the update of particle zs hence drives particle zsaway from particles zs′ , i.e., in the direction (zs − zs′), to a degreethat is proportional to the similarity κ(zs , zs′).


ExampleIn the figure, the blue line is the true posterior (obtained by normalizingp(x , z)), and the red line represents the KDE, with h = 0.1 obtained fromthe S = 20 particles as a function of the iteration number i .The particles first move towards the first peak of the posterior driven by thefirst term in the update and are then repulsed to cover the support of thedistribution by the second term.

-4 -2 0 2 4

0

0.2

0.4

-4 -2 0 2 4

0

0.2

0.4

-4 -2 0 2 4

0

0.2

0.4

-4 -2 0 2 4

0

0.2

0.4

-4 -2 0 2 4

0

0.2

0.4

-4 -2 0 2 4

0

0.2

0.4


Amortized Variational Inference


Amortized VI

Up to now, we have studied the problem of computing anapproximation of the posterior p(z |x) for a given value x .This yields computationally intensive solutions whereby theoptimization of the variational posterior q(z |x , ϕ) needs to berepeated for all values of x of interest.

Amortized VI aims at “amortizing” the complexity of VI by sharingthe parameters ϕ of the variational posterior q(z |x , ϕ) for all values x .This is done by minimizing the average free energy

minϕ

Ex∼p(x)[fe(x, q(z |x, ϕ))],

where the average is taken over the marginal distribution p(x) of theinput.

The model q(z |x , ϕ) can be any parametrized conditional probability.


Amortized VIAs an example, a typical choice for the amortized variational posterioris the Gaussian distribution defined as

q(z |x , ϕ) = N

z∣∣∣∣∣∣∣∣∣∣∣

ν1(x |ϕ)ν2(x |ϕ)

...νM(x |ϕ)

,

s1(x |ϕ)2 0 · · · 0

0 s2(x |ϕ)2...

... 0. . . 0

0 · · · 0 sM(x |ϕ)2

= N

(z∣∣∣ν(x |ϕ),Diag(s(x |ϕ)2)) ,

where the 2M × 1 vector (ν(x |ϕ), log(s(x |ϕ))) is the output of aneural network with weights ϕ and input x .

Note that this parametrization assumes standard deviationssm(x |ϕ) = exp(log(sm(x |ϕ))), which ensures a positive value for allpossible (finite) outputs log(s(x |ϕ)) of the neural network.A key advantage of this parametrization is that, assuming a Gaussianprior p(z) = N (z |0, I ), it enables the use of the reparametrizationtrick, as detailed next.


Amortized VIUsing this choice for the amortized variational posterior, to apply thereparametrization trick with S = 1, we draw one sample x ∼ p(x) and oneauxiliary sample e ∼ N (0, I ). Furthermore, at the given parameter vector ϕ,we compute z = ν(x|ϕ) + s(x|ϕ) � e.We then need to compute the gradient estimate ∇̂ϕEx∼p(x)[fe(x, q(z |x, ϕ))],which is given as

∇ϕ(ν(x|ϕ) + s(x|ϕ) � e) · ∇z(− log p(x|z)) +∇ϕKL(q(z |x , ϕ)||p(z)).

The first term can be directly computed as

(∇ϕν(x|ϕ) +∇ϕs(x|ϕ) � e) · ∇z(− log p(x|z)),

where the Jacobians ∇ϕν(x|ϕ) and ∇ϕs(x|ϕ), as well as the log-lossgradient ∇z(− log p(x|z)) can be directly computed, typically via automaticdifferentiation, depending on the given implementation.

Using the gradients seen above for the KL divergence, the second term canbe evaluated as

∇ϕKL(q(z |x , ϕ)||p(z)) = ∇ϕν(x|ϕ)·ν(x|ϕ)+∇ϕs(x|ϕ)·(s(x|ϕ)−(s(x|ϕ))−1).


Reparametrized Amortized VI

Note that the expression above for the gradient estimate∇̂ϕEx∼p(x)[fe(x, q(z |x, ϕ(i)))] assumes the general case in which theparameters ϕ affect both mean and standard deviation vectors.

If disjoint parts of the vector ϕ enter the mean and standard deviationvectors, the gradient with respect to each part of the vector includesonly the relevant two terms in the sum presented in the previous slide.


Black-Box Amortized VI

When the variational distribution does not have a reparametrizablestructure, one can use black-box VI, i.e., the REINFORCE gradient, inlieu of the reparametrization trick, yielding the algorithm below.


for i = 1, 2, ...I draw one sample x(i) ∼ p(x) and one sample for the latent variable as

z(i) ∼ q(z |x(i), ϕ(i));I obtain next iterate as

ϕ(i+1) ← ϕ(i)+γ(i)(c(i) − log

(q(z(i)|x(i), ϕ(i))p(x(i), z(i))

))·∇ϕ log q(z(i)|x(i), ϕ(i));



c(i) = 1L∑L

j=1 log(q(z(i−j)|x(i−j),ϕ(i−j))

p(x(i−j),z(i−j))

).


Amortized VI

An important observation is that the algorithms summarized aboveare doubly stochastic: The estimate of the gradient requires samplingboth data and auxiliary variables for the latent.

They can be directly generalized by using multiple samples periteration rather than a single one.


Variational EM


Variational EM

As we have seen in Chapter 7, optimal soft prediction plays a key rolein the design of training algorithms for models with latent variables.

We will now discuss how VI can be leveraged to make theimplementation of such algorithms more efficient and scalable.

Consider a generative model p(x , z |θ) with observed vector x, latentvector z, and model parameter θ. As we have seen, this model maybe used in different settings:

I for unsupervised learning of directed and undirected generative models;I and for supervised learning of mixture models (in which case x

represents the output and there in an implicit conditioning on theinput).

Given a data set D = {xn}Nn=1, we specifically focus on the MLproblem

minθ− 1

N

N∑n=1

log p(xn|θ).


Variational EM

The EM algorithm assumes that both steps can be carried outexactly:

I given the current iterate θold, the E step returns the optimal solutionqn(zn|xn) = p(zn|xn, θold), with p(zn|xn, θold) being the posteriorcomputed from the joint p(xn, zn|θold);

I and the M step optimizes exactly the problem

minθ

{1

N

N∑n=1

`s(x , p(zn|xn, θold)|θ)

}.

Variational EM (VEM) replaces the E step with the approximateposterior obtained via amortized VI, while the M step is also typicallyapproximated via SGD.


Variational EM

Specifically, VEM introduces a class Q of parametric posterior modelsq(z |x , ϕ), and it tackles the problem

minθ,ϕ

1

N

N∑n=1

fe(xn, q(zn|xn, ϕ)|θ).

Note that the variational parameter ϕ is shared across all data points: Thevariational model is amortized across the entire data set.

As another remark, one could consider factorized, and possibly parametrized,posteriors, but we will not detail this option here.

Depending of the form of the variational posterior, optimization over ϕ canbe done using either black-box VI, i.e., using the REINFORCE gradient, orthe reparametrization trick.

Furthermore, there are several ways to carry out the optimization over modelparameters θ and variational parameters ϕ.


Variational EM

per-observation

latent variables

model parameter

𝑧𝑛

observation

𝑥𝑛

n=1,…,𝑁

𝜃

𝑞(𝑧𝑛 |𝑥𝑛 , φ)

𝑝(𝑥𝑛 , 𝑧𝑛 |𝜃)

variational param.

φ


Variational EM

Below, we detail a standard approach based on SGD, in which optimizationsteps for E and M steps are carried out simultaneously.

The gradients over both model and variational parameters are estimatedusing doubly stochastic estimators based on samples from observed andlatent variables. In particular, the gradient over the variational parameters isobtained via black-box amortized VI.


Black-Box VEMinitialize θ(1), ϕ(1) (e.g., randomly)for i = 1, 2, ...

I draw one sample x(i) from the data set D and one sample for thelatent variable as z(i) ∼ q(z |x(i), ϕ(i));

I obtain next iterate for the variational parameters using theREINFORCE gradient (black-box VI) as

ϕ(i+1) ← ϕ(i)+γ(i)(c(i) − log

(q(z(i)|x(i), ϕ(i))p(x(i), z(i)|θ(i))

))·∇ϕ log q(z(i)|x(i), ϕ(i));

I obtain next iterate for the model parameters

θ(i+1) ← θ(i) + ξ(i)∇θ log p(x(i), z(i)|θ(i));

I if stopping criterion satisfied, then return θ(i+1) and ϕ(i+1) otherwise,continue.


c(i) = 1L∑L

j=1 log(q(z(i−j)|x(i−j),ϕ(i−j))p(x(i−j),z(i−j)|θ(i−j))

).


Example

Consider again the beta-exponential model, but let us assume nowthat the likelihood contains a trainable parameter θ. Mathematicallywe have the following joint distribution p(x , z |θ):

z ∼ p(z) = Beta(z |a, b)

x|z ∼ p(x |z , θ) = Exp(x |θz) = 1z

exp(− xθz

)1(x ≥ 0),

where the prior parameters (a, b) are fixed.

Unlike the example above, in which we focused on the approximationof the posterior p(z |x), we are now interested in training thegenerative model p(x , z |θ) = p(z)p(x |z , θ) via ML based on a dataset D of observations D = {xn}Nn=1.


Example

To this end, we apply black-box VEM by assuming a beta variationalposterior.

We consider first a non-amortized solution in which at each iterationi , we run several (here 100) iterations of black-box VI in order to

obtain a variational distribution q(z |x(i), ϕ(i)) = Beta(z |ϕ(i)1 , ϕ(i)2 )

that approximates the posterior p(z |x(i), θ(i)).This solution requires to run a separate black-box VI process for eachiteration.

The figure shows one random run of the algorithm.


Example

The panels on the right show the evolution of the variationalparameters for two iterations i .

4.95 5 5.05 5.1 5.15

4.8

4.85

4.9

4.95

5

4.6 4.7 4.8 4.9 5

4.9

4.95

5

5.05

5.1

0 500 1000

0.1

0.2

0.3

0.4

0.5

0.6

0.7

0.8

0.9

1


ExampleWe then consider the (amortized) black-box VEM algorithm detailed above. Forthis purpose, we set the amortized posterior q(z |x , ϕ) = Beta(z |ϕ1(1 + x), ϕ2),where ϕ = [ϕ1, ϕ2]

T ∈ R2.Note that here the variational parameters evolve with the model parameters alongthe same number of iterations.

The figure shows one random run of the algorithm.


Variational Autoencoders

Variational autoencoders (VAEs) are an instance of VEM that usesthe reparametrization trick, instead of black-box VI, in order toestimate the gradient over the variational parameters.

Accordingly, VAEs set the amortized variational posterior as

q(z |x , ϕ) = N(z∣∣ν(x |ϕ),diag(s(x |ϕ)2))

as discussed above, as well as the prior as p(z) = N (z |0, I ).The resulting algorithm can be derived as above but withreparametrization trick in lieu of the REINFORCE gradient.


Variational EM

To conclude this section, we highlight a limitation of VEM, oftenreferred to as posterior mode collapse.

Assuming models of the form p(x , z |θ) = p(z)p(x |z, θ), VEMaddresses the problem

minθ,ϕ

1

N

N∑n=1

fe(xn, q(zn|xn, ϕ)|θ),

where the free energy is defined as

fe(x , q(z |x , ϕ)|θ) = Ez∼q(z|x ,ϕ) [− log p(x |z, θ)] + KL(q(z |x , ϕ)||p(z)).

Suppose now that the model p(x |z, θ) has a large capacity, so thatthere exists a model parameter θ such that p(x |z , θ) ' pD(x)irrespective of the value of the latent vector z.


Variational EM

Then, the optimal variational posterior q(z |x , ϕ) equals the priorp(z), assuming that the variational model class is large enough. As aresult, the variational posterior is not useful for inference.

The point here is that, when applying the VEM algorithm, one is notguaranteed to obtain a useful variational posterior.

Some solutions add a mutual information term I(x; z), evaluatedusing the variational posterior, in order to avoid the case outlinedabove in which x and z are independent.


Generalized Bayesian Inference,Generalized VI, and Generalized VEM


Generalized Bayesian InferenceAs we have discussed throughout this chapter, Bayesian inference can beformulated as the optimization problem

minq(z|x)

Ez∼q(z|x) [− log p(x |z)]︸︷︷︸average log-loss in reconstruction of x from z

+ KL(q(z |x)||p(z))︸︷︷︸divergence between variational posterior and prior

for a fixed observation x .When the log-loss is substituted by any other loss function `(x , z) and theKL divergence with any other regularizing measure C(q(z |x)), constrainedto be a convex function of q(z |x), we obtain generalized Bayesian inference

minq(z|x)

Ez∼q(z|x) [`(x , z)]︸︷︷︸average loss in reconstruction of x from z

+ α C(q(z |x))︸︷︷︸regularizing “complexity” measure for the variational posterior

,

where α > 0 is a “temperature” parameter.We will refer to the objective of the problem above as generalized freeenergy.


Generalized Bayesian Inference

Examples of “complexity” measures C(q(z |x)) includeI the KL divergence KL(q(z |x)||p(z)) with respect to some prior

distribution p(z);I the negative entropy H(q(z |x));I and the quadratic total variation distance

12 ||q(z |x)− p(z)||

2 = 12∑

z |q(z |x)− p(z)|2 with respect to some priordistribution p(z) (with sum replaced by an integral for continuouslatent rvs);

I or other divergence measures between q(z |x) and prior p(z) (see nextchapter).



When the KL divergence C(q(z |x)) = KL(q(z |x)||p(z)) is used, the optimalsolution of the generalized Bayesian inference problem can be derived asfollows.

Let us write the generalized free energy as

Ez∼q(z|x) [`(x , z)] + αKL(q(z |x)||p(z))

= Ez∼q(z|x) [`(x , z)] + α Ez∼q(z|x)

[log

(q(z|x)p(z)

)]=α Ez∼q(z|x)

[log

(q(z|x)

p(z) exp(− 1α`(x , z)

))]=αKL(q(z |x)||p(z) exp(−α−1`(x , z))),

where we have used the convention of defining the KL divergence even whenthe second term is not normalized.



It follows that the generalized Bayesian inference problem can bereformulated as the minimization

minq(z|x)

KL(q(z |x)||p(z) exp(−α−1`(x , z))).

In the Appendix we show that the solution of the more generalproblem

minq(z)

KL(q(z)||p̃(z)),

where p̃(z) is an unnormalized distribution over z , is given by thenormalized distribution

q∗(z) =p̃(z)

Z,

where constant Z ensures normalization.



It follows that the optimal solution of the generalized Bayesianproblem with KL divergence-based regularization is given as

q∗(z |x) ∝ p(z) exp(− 1α`(x , z)

).

This distribution is known as Gibbs posterior, and it corresponds tothe standard posterior if the loss function is the log-loss, i.e., if`(x , z) = − log p(x |z), and if α = 1.The Gibbs posterior “tilts” the prior p(z) with a term that dependson the loss `(x , z).



Examples of generalized posteriors, i.e., of optimal solutions to theproblem of minimizing the free energy can be found in the tablebelow. The derivation follows by leveraging connections to Fenchelduality, as detailed in the Appendix. (Parameter τ satisfies∑

z(p(z)−1α`(x , z)− τ)

+ = 1 with (x)+ = max(x , 0)).

C(q(z |x)) gen. posterior q∗(z |x)

KL(q(z |x)||p(z)) ∝ p(z) exp(− 1α`(x , z))

−H(q(z |x)) ∝ exp(− 1α`(x , z))12 ||q(z |x)− p(z)||

2 (p(z)− 1α`(x , z)− τ)+


Example

The figure below shows the generalized posterior for priorp(z) = N (z |0, 1) and loss function plotted in the figure.

-1 0 1 2 30

0.005

0.01

-1 0 1 2 30

0.005

0.01

-1 0 1 2 3z

0

0.01

0.02

ℓ C = −H C = KL C = 0.5|| ∙ ||2

𝛼 = 10

𝛼 = 0.5 × 10−2

𝛼 = 10−3

𝑞∗

𝑞∗

𝑞∗


Example

When α = 10, the generalized posterior puts more weight onminimizing the complexity term C(q(z |x)).

I As a result, for C(q(z |x)) = −H(q), the generalized posterior is closeto a uniform distribution, whereas it approximates the prior p(z) forD(q) = KL(q||p) and 0.5||q − p||2.

For smaller temperature values α, the emphasis shifts on minimizingthe average loss and the generalized posterior increasinglyconcentrates on the minimizer of the loss function `(x , z).


Generalized VI

When optimizing over a restricted class of posteriors, the generalizedBayesian inference problem is referred to as generalized VI (GVI)

minq(z|x)∈Q

Ez∼q(z|x) [`(x , z)] + αC(q(z |x)).

All the techniques discussed above, with both factorized andparameterized variational posteriors can be applied.

For instance, parametrized GVI amounts to the problem

minϕ

Ez∼q(z|x ,ϕ) [`(x , z)] + αC(q(z |x , ϕ)).


Generalized VEM

In an analogous way, generalized VEM (GVEM) can also be directlydefined as the problem

minθ,ϕ

E(x,z)∼p(x)q(z|x ,ϕ) [`(x, z|θ)] + α Ex∼p(x)[C(q(z |x, ϕ))],

where the loss function now depends on the model parameters θ.

The marginal p(x) is generally unknown and replaced with theempirical distribution, yielding the problem

minθ,ϕ

1

N

N∑n=1

{Ezn∼q(z|xn,ϕ) [`(xn, zn|θ)] + αC(q(z |xn, ϕ))

}.

We will see applications of GEM and GVEM in the next chapter.


Summary


Summary

Most of the algorithms studied in this chapter tackle special cases of thegeneralized variational EM (VEM) problem. Given a data set D = {(xn)Nn=1}, thisamounts to the optimization over model parameters θ and variational parameter ϕof the generalized free energy criterion as in

minϕ,θ

1

N

N∑n=1

Ezn∼q(z|xn,ϕ) [`(xn, zn|θ)]︸︷︷︸average loss in reconstruction of xn from zn

+α

N

N∑n=1

KL(q(z |xn, ϕ)||p(z |θ))︸︷︷︸“complexity” of the variational posterior

,

where

I q(z |x , ϕ) represents a class of variational distributions, in which variationalparameter ϕ is amortized across all values of observation x ;

I `(x , z |θ) is a loss function, such as the log-loss `(x , z |θ) = − log p(x |z , θ),which depends on model parameter θ;

I p(z) is a reference prior distribution;I and α is a “temperature” parameter.


Summary

When the model parameters θ are fixed, we obtain the VI problem.

For the VI problem, if no constraint is imposed on the form of thevariational posterior, the optimal solution can be efficiently found onlyin the special cases such as when the joint distribution is in theconjugate exponential family;

Otherwise, the problem can be efficiently tackled either by black-boxVI or by the reparametrization trick.

I The latter requires a special structure for the variational posterior,while the former can be applied more generally.

I Both methods apply a doubly stochastic estimate of the gradient.

When the model parameters θ are to be trained, the problem can beaddressed by the variational EM (VEM) algorithm, in which SGDupdates are applied to both model and variational parameters.


Appendix


Dirichlet-Categorical ModelThe Dirichlet-categorical model is another example of a conjugateexponential-family distribution.The likelihood is given by the categorical distribution

p(x |z) =C−1∏k=0

z1(x=k)k ,

and the conjugate prior is the Dirichlet distribution

p(z0, ..., zC−1|α0, ..., αC−1) = Dir(z0, ..., zC−1|α0, ..., αC−1) ∝C−1∏k=0

zαk−1k ,

with parameters α = (α0, ..., αC−1).The Dirichlet distribution has mean and mode given as

Ezk∼Dir(z0,...,zC−1|α0,...,αC−1)[zk ] =αk∑C−1j=0 αj

modezk∼Dir(z0,...,zC−1|α0,...,αC−1)[zk ] =αk − 1∑C−1

j=0 αj − C.


Dirichlet-Categorical Model

The posterior distribution given L i.i.d. observationsx1, ..., xL ∼ p(x |z) is the Dirichlet distribution

p(z |x1, ..., xL) = Dir(z0, ..., zC−1|α0 + L[0], ..., αC−1 + L[C − 1])

∝C−1∏k=0

µαk+L[k]k .


Derivation of Update Equation for Mean-Field VI

We need to show that we have

q∗m(zm|x) ∝ exp(Ez−m [log p(x , zm, z−m)]

),

where z−m ∼ q(z−m|x) :=∏

m′ 6=m q(i−1)m′ (zm′ |x).

Under the mean-field factorization, the entropy H(q(z|x)) equals thesum

H(q(z|x)) =M∑

m′=1

H(qm′(zm′ |x)).

Furthermore, by using the law of iterated expectations, we can writethe supervised log-loss as

`s(x , q(z |x)) = Ez∼q(z|x) [− log p(x , z)] .= Ezm

[Ez−m [− log p(x , zm, z−m)]

],

where zm ∼ qm(zm|x).Osvaldo Simeone ML4Engineers 140 / 149

Derivation of Update for Mean-Field VI

The problem of interest, namely

minqm(zm|x)

fe(x , q(z |x) = qm(zm|x) · q(i−1)−m (z−m|x)

),

can therefore be equivalently expressed as

minqm(zm|x)

Ezm∼qm(zm|x)[Ez−m [− log p(x , zm, z−m)]

]−H(qm(zm|x)).

This corresponds to the free energy of a model with loss functiongiven by `(x , zm) = Ez−m [− log p(x , zm, z−m)], and hence the optimalsolution q∗m(zm|x) follows directly from the solution of the generalizedVI problem.


Variance of the REINFORCE Gradient

Define ∇̂ϕ(z|c) := (g(z)− c) · ∇ϕ log q(z|ϕ) as the REINFORCE estimator,with z ∼ q(z |ϕ).The estimator is unbiased, i.e.,Ez∼q(z|ϕ)[∇̂ϕ(z|c)] = ∇ϕEz∼q(z|ϕ)[g(z)] =: ∇ϕ.Its variance is given as

Var(∇̂ϕ(z|c)) = Ez∼q(z|ϕ)[||∇̂ϕ(z)−∇ϕ||2]= Ez∼q(z|ϕ)[||∇̂ϕ(z|c)||2]− ||∇ϕ||2,

where

Ez∼q(z|ϕ)[||∇̂ϕ(z|c)||2] = Ez∼q(z|ϕ)[(g(z)− c)2||∇ϕ log q(z|ϕ)||2].


Variance of the REINFORCE Gradient

What is the optimal value of c?

The function Var(∇̂ϕ(z|c)) is convex in c , so differentiating withrespect to c yields

∂

∂cVar(∇̂ϕ(z|c)) = −2Ez∼q(z|ϕ)[(g(z)− c)||∇ϕ log q(z|ϕ)||2]

and the first-order optimality condition leads to the optimal solution

c∗ =Ez∼q(z|ϕ)[g(z)||∇ϕ log q(z|ϕ)||2]Ez∼q(z|ϕ)[||∇ϕ log q(z|ϕ)||2]

.

If we approximate the expectations using the past iterates for z, thisyields the selection

c(i) =1L

∑Lj=1 g(z

(i−j))||∇ϕ log q(z(i−j)|ϕ(i−j))||21L

∑Lj=1 ||∇ϕ log q(z(i−j)|ϕ(i−j))||2

.


machine learning for engineers: chapter 10. variational ...machine learning for engineers: chapter...

Documents