machine learning for engineers: chapter 10. variational ...machine learning for engineers: chapter...

149
Machine Learning for Engineers: Chapter 10. Variational Inference and Variational Expectation Maximization Osvaldo Simeone King’s College London February 7, 2021 Osvaldo Simeone ML4Engineers 1 / 149

Upload: others

Post on 31-Jan-2021

7 views

Category:

Documents


0 download

TRANSCRIPT

  • Machine Learning for Engineers:Chapter 10. Variational Inference and Variational

    Expectation Maximization

    Osvaldo Simeone

    King’s College London

    February 7, 2021

    Osvaldo Simeone ML4Engineers 1 / 149

  • This Chapter

    In the previous chapters, we have often noted the important roleplayed by soft prediction, also known as Bayesian inference:

    I In Chapter 6, we have seen that, once a generative model p(x , t|θ) istrained, optimal prediction requires the computation of the optimal softpredictor, i.e., of the posterior p(t|x , θ) of the target t given the inputx;

    I In Chapter 7, we have discussed that, in order to train a modelp(x |θ) = Ez∼p(z|θ)[p(x |z, θ)] with latent variables z for unsupervisedlearning, the EM algorithm requires the computation of the posteriorp(z |x , θ) at each iteration;

    I and, similarly, in order to train a mixture modelp(t|x , θ) = Ez∼p(z|θ)[p(t|x , z, θ)] for supervised learning, the EMalgorithm requires the computation of the posterior p(z |x , t, θ) at eachiteration.

    Osvaldo Simeone ML4Engineers 2 / 149

  • This Chapter

    What can be done when the calculation of the posterior is toocomplex?

    In this chapter, we first review cases in which the posterior can becomputed in closed form, and then cover approximate Bayesianinference methods based on variational inference (VI).

    We will also study the application of variational inference for thetraining of latent-variable models, introducing the variational EM(VEM) algorithm, as well as the variational autoencoders as a specialcase of VEM.

    We will finally discuss generalized VI and VEM algorithms.

    Osvaldo Simeone ML4Engineers 3 / 149

  • Overview

    Exact Bayesian inference

    Variational inference

    Mean-field variational inference

    Black-box variational inference

    Reparametrization-based variational inference

    Other variational inference methods

    Non-parametric variational inference

    Amortized variational inference

    Variational EM

    Generalized Bayesian inference, generalized VI, and generalized VEM

    Osvaldo Simeone ML4Engineers 4 / 149

  • Exact Bayesian Inference

    Osvaldo Simeone ML4Engineers 5 / 149

  • Bayesian Inference

    Consider a joint distribution p(x , z) over observable variable x andlatent variable z. Bayesian inference amounts to the computation ofthe posterior distribution

    p(z |x) = p(x , z)p(x)

    .

    This requires marginalizing over the latent variables in order to obtainthe marginal distribution p(x) = Ez∼p(z)[p(x |z)].This computation is only feasible

    I if the latent vector is low-dimensional so that the expectation can beevaluated numerically;

    I or if the joint distribution has a special structure that enables anexplicitly analytical evaluation of the expectation.

    Osvaldo Simeone ML4Engineers 6 / 149

  • Example

    Consider the joint distribution p(x , z) = p(z)p(x |z) defined as

    z ∼ p(z) = N (z |0, I ),x|z = z ∼ p(x |z) = N (x |µ(z), I ),

    where the mean vector µ(z) is a function of the latent variable z.

    Computing the posterior p(z |x) can be done in closed form ifµ(z) = Az for some matrix A. We have encountered this example inChapter 3, where we have seen that the posterior is also Gaussian.We refer to this example as the Gaussian-Gaussian model.

    If µ(z) is a non-linear function of z , e.g., defined by a neural network,the problem of computing p(z |x) for any given value of x is ofprohibitive complexity unless the dimension of z is small so thatintegration in the expectation p(x) = Ez∼p(z)[p(x |z)] can be wellapproximated using numerical integration or Monte Carlo techniques.

    Osvaldo Simeone ML4Engineers 7 / 149

  • Conjugate Exponential Family

    The Gaussian-Gaussian model is a special case of a general class ofstructured joint distributions that admit an efficient computation ofthe posterior: the conjugate exponential family.

    For joint distributions in the conjugate exponential family, theposterior p(z |x) is in the same class of distributions of the prior p(z).As an example, for the Gaussian models, the posterior is Gaussian likethe prior.

    Osvaldo Simeone ML4Engineers 8 / 149

  • Conjugate Exponential Family

    Consider a likelihood p(x |z) that, for every fixed value of z , takes theform of the exponential-family distribution

    p(x |z) = exp(ηl(z)

    T sl(x) + Ml(x)− Al(z))

    with D × 1 sufficient statistics vector sl(x) and log-base measurefunction Ml(x). The subscript l indicates that these quantities arerelated to the likelihood.

    The dependence of rv x on rv z is through the D × 1 naturalparameter vector ηl(z), which is a general function of z .

    The log-partition function Al(η(z)) is denoted as Al(z) for simplicityof notation.

    Osvaldo Simeone ML4Engineers 9 / 149

  • Conjugate Exponential Family

    Let us now assume that the prior distribution p(z) also belongs to a,generally different, class of distributions in the exponential family.Accordingly, we write

    p(z) = exp(ηTp sp(z) + Mp(z)− Ap(ηp)

    ),

    where the subscript p identifies natural parameters, sufficientstatistics, log-base measure function, and log-partition function forthe prior.

    We assume that the prior p(z) is a standard distribution for which thepartition function Ap(ηp) is available in closed form, and henceevaluating p(z) does not require any numerical integration.

    As we discussed in Chapter 9, this class of distributions includes manystandard choices for both discrete and continuous distributions.

    Osvaldo Simeone ML4Engineers 10 / 149

  • Conjugate Exponential Family

    Now, if we compute the posterior we obtain

    p(z |x) ∝ p(z)︸︷︷︸prior

    × p(x |z)︸ ︷︷ ︸likelihood

    ∝ exp(sl(x)

    Tηl(z) + ηTp sp(z) + Mp(z)− Al(z)

    ),

    where we have highlighted only the terms dependent on z .

    For a fixed x , this distribution is generally not easy to normalize, sincethe sufficient statistics (ηl(z), sp(z)) and log-base measureMp(z)− Al(z) do not correspond to one of the standard distributionsin the exponential family.

    Osvaldo Simeone ML4Engineers 11 / 149

  • Conjugate Exponential Family

    This is, however, not the case if we choose the sufficient statistics ofthe prior as the (D + 1)× 1 vector

    sp(z) =

    [ηl(z)−Al(z)

    ].

    This choice of prior is said to be conjugate with respect to thelikelihood p(x |z) defined above.

    Osvaldo Simeone ML4Engineers 12 / 149

  • Conjugate Exponential Family

    Plugging this choice in the expression for the posterior, we have

    p(z |x) ∝ exp(sl(x)

    Tηl(z) + ηTp sp(z) + Mp(z)− Al(z)

    )∝ exp

    (ηpost(x)

    T sp(z) + Mp(z)),

    where

    ηpost(x) = ηp +

    [sl(x)−1

    ].

    The posterior is hence in the same class of distributions as the prior,which is characterized by the pair (sp(z),Mp(z)) of sufficientstatistics and log-base measure function.

    Osvaldo Simeone ML4Engineers 13 / 149

  • Conjugate Exponential Family

    Computing the posterior only requires to obtain the naturalparameters ηpost(x) by summing to the natural parameters ηp of theprior a D + 1 vector that consists of the sufficient statistics of thelikelihood, sl(x), and of −1 as the last component.It is emphasized that, since the prior is easy to normalize, so is theposterior.

    For every likelihood in the exponential family, one can always find aconjugate prior.

    The Gaussian-Gaussian is one example. We will now see another.

    Osvaldo Simeone ML4Engineers 14 / 149

  • Beta-Bernoulli Model

    The beta-Bernoulli model consists of a Bernoulli likelihood with abeta prior, which produces a beta posterior.

    The likelihood p(x |z) = Bern(x |z) is Bernoulli, with latent variable zidentifying the mean parameter, i.e., the probability of the eventx = 1 conditioned on z = z : We have p(x = 1|z) = z andp(x = 0|z) = 1− z .Using the notation introduced above, we have

    p(x |z) = Bern(x |z) = exp

    log(

    z

    1− z

    )︸ ︷︷ ︸

    ηl (z)

    x︸︷︷︸sl (x)

    − (− log(1− z))︸ ︷︷ ︸Al (z)

    .

    Osvaldo Simeone ML4Engineers 15 / 149

  • Beta-Bernoulli Model

    Therefore, the conjugate prior has sufficient statistics

    sp(z) =

    [log(

    z1−z

    )log(1− z)

    ],

    and, writing ηp = [ηp,1, ηp,2]T for the vector of natural parameters of

    the prior, the conjugate prior is given by

    p(z) ∝ exp(ηp,1 log

    (z

    1− z

    )+ ηp,2 log(1− z)

    )= zηp,1(1− z)−ηp,1+ηp,2

    = za−1(1− z)b−1

    =: Beta(z |a, b),

    which corresponds to a beta distribution with parameters(a := ηp,1 + 1, b := −ηp,1 + ηp,2 + 1).

    Osvaldo Simeone ML4Engineers 16 / 149

  • Beta-Bernoulli ModelThe parameters (a, b) define the shape of the Beta prior as seen inthe figure. Note that the support of the Beta distribution is theinterval [0, 1].

    0 0.2 0.4 0.6 0.8 10

    0.5

    1

    1.5

    2

    2.5

    3

    3.5

    4

    4.5

    5

    Osvaldo Simeone ML4Engineers 17 / 149

  • Beta-Bernoulli Model

    The average and the mode, i.e., the value with the maximum density,of the beta distribution are given as

    Ez∼Beta(a,b)[z] =a

    a + b

    modez ∼Beta(a,b)[z] =a− 1

    a + b − 2when a, b > 1.

    Therefore, increasing a skews the distribution towards larger values ofz , while increasing b skews the distribution towards smaller values ofz .

    Furthermore, the variance of the prior,Var[z ] = ab/((a + b)2(a + b + 1)), decreases with a + b.

    Note that the beta prior is a distribution over the space of aprobability.

    Osvaldo Simeone ML4Engineers 18 / 149

  • Beta-Bernoulli Model

    Due to conjugacy, the posterior is also a beta distribution withnatural parameters

    ηpost(x) = ηp +

    [sl(x)−1

    ].

    Using the relationship between the natural parameters and theparameters (a, b) of a beta distribution, we can hence write theposterior as p(z |x) = Beta(z |apost(x), bpost(x)), with

    apost(x) = ηp,1 + x + 1 = a + x

    bpost(x) = −ηp,1 − x + ηp,2 + 1 + 1 = b + (1− x).

    Osvaldo Simeone ML4Engineers 19 / 149

  • Beta-Bernoulli Model

    In summary, we have

    p(z |x) = Beta(z |a + 1(x = 1), b + 1(x = 0)),

    which implies that we simply add 1 to a if x = 1 and add 1 to b ifx = 0.

    This skews the distribution towards larger values of z if x = 1, andvice versa it skews it towards smaller values of z if x = 0.

    Osvaldo Simeone ML4Engineers 20 / 149

  • Beta-Bernoulli ModelSuppose that we make two conditionally independent observations x1and x2 from this model, so that the joint distribution is

    p(x1, x2, z) = Beta(z |a, b) · Bern(x1|z) · Bern(x2|z).

    We can write the posterior as

    p(z |x1, x2) ∝ (Beta(z |a, b) · Bern(x1|z)) · Bern(x2|z)∝ Beta(z |a + 1(x1 = 1), b + 1(x1 = 0))︸ ︷︷ ︸

    =p(z|x1)

    ·Bern(x2|z).

    This implies that we can treat p(z |x1) as the prior as we makeanother independent observation x2 ∼ p(x2|z).It follows that we have the posterior

    p(z |x1, x2) = Beta

    z∣∣∣∣∣∣a +

    2∑j=1

    1(xj = 1), b +2∑

    j=1

    1(xj = 0)

    .Osvaldo Simeone ML4Engineers 21 / 149

  • Beta-Bernoulli ModelBy repeating this approach recursively for L times, it follows that theposterior p(z |x1, ..., xL) for the joint distribution

    p(x1, ..., xL, z) = Beta(z |a, b)L∏

    j=1

    Bern(xj |z)

    of L conditionally independent observations x1, ..., xL and latentvariable z is given by

    p(z |x1, ..., xL) = Beta

    z∣∣∣∣∣∣∣∣∣∣∣a +

    L∑j=1

    1(xj = 1)︸ ︷︷ ︸=:L[1]

    , b +L∑

    j=1

    1(xj = 0)︸ ︷︷ ︸=:L[0]

    .

    We can then interpret a and b as “pseudocounts”, i.e., as priorobservations of 1’s and 0’s made before we measure x1, ..., xL.

    Osvaldo Simeone ML4Engineers 22 / 149

  • Example

    On an online shopping platform, there are two sellers offering aproduct at the same price:

    I The first has 30 positive reviews and 0 negative reviews, while thesecond has 90 positive reviews and 10 negative reviews. Which one tochoose?

    Osvaldo Simeone ML4Engineers 23 / 149

  • Example

    Let us introduce a latent variable z ∼ Beta(z |a, b) that describes theprobability of a vendor receiving a positive review.

    Then, we model the reviews x1, ..., xL as independent binaryobservations, as seen above, with xi = 1 representing a positivereview.

    The posterior p(z |x) is hence given asp(z |x) = Beta(z |a + L[1], b + L[0]), where L[1] counts the number ofpositive reviews and L[0] counts the number of negative reviews.

    We have L[1] = 30 with L = 30 for the first seller, and L[1] = 90 withL = 100 for the second.

    Osvaldo Simeone ML4Engineers 24 / 149

  • Example

    When the prior is sufficiently strong, i.e., with a + b is large enough andhence variance small enough, the posterior for the second vendor has a largermode. This should suggest caution in choosing the first vendor based on theavailable evidence: Unless the prior is weak, and hence the posterior dependsmostly on the data likelihood, the second vendor appears to be preferable.

    0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

    0

    5

    10

    15

    0.5 0.55 0.6 0.65 0.7 0.75 0.8 0.85 0.9 0.95 1

    0

    5

    10

    Osvaldo Simeone ML4Engineers 25 / 149

  • Conjugate Exponential Family

    As mentioned, for all distributions in the exponential family, there exists aconjugate prior on the model parameters. The table below provides someexamples. The Dirichlet-categorical example is detailed in the Appendix. Weused the notation Θ′ = Θ0 + LΘ for the last row.

    likelihood p(x|z) =∏L

    j=1 p(xj |z) conjugate prior p(z) posterior p(z|x)

    Bern(xj |z), with z = E[xj ] Beta(a, b) Beta (a + L[1], b + L[0])

    Cat(xj |z) with zk = E[xOHk,j ] Dirichlet(α) Dirichlet(α +

    ∑Lj=1 x

    OHj

    )N (xj |z,Θ−1), with fixed Θ N (µ0,Θ

    −10 ) N

    ((Θ′)−1

    (Θ−10 µ0 + Θ

    −1 ∑Lj=1 xj

    ), (Θ′)−1

    )

    Osvaldo Simeone ML4Engineers 26 / 149

  • Conjugate Exponential Family

    To conclude this section, it should be mentioned that there are alsoconjugate models with likelihood not in the exponential family.

    An example is the uniform likelihood distribution p(x |z) = U(x |[0, z ])with Pareto prior p(z). The posterior p(z |x) is also a Paretodistribution. Recall that the uniform distribution is not in theexponential family.

    Osvaldo Simeone ML4Engineers 27 / 149

  • Variational Inference

    Osvaldo Simeone ML4Engineers 28 / 149

  • Bayesian Inference as Optimization

    As we have seen in Chapter 7, the posterior distribution can beobtained as the solution of the minimization of the free energy, i.e.,

    minq(z|x)

    fe(x , q(z |x)),

    where the optimization is over all possible conditional distributionsq(z |x) and the free energy is defined as

    fe(x , q(z |x)) = Ez∼q(z|x) [− log p(x , z)]︸ ︷︷ ︸=`s(x ,q(z|x)), supervised log-loss

    − Ez∼q(z|x) [− log q(z|x)]︸ ︷︷ ︸=H(q(z|x)), entropy of the variational posterior

    Osvaldo Simeone ML4Engineers 29 / 149

  • Variational Inference

    Since the conditional distribution q(z |x) is subject to optimization,we will refer to it as variational posterior in this chapter.

    The problem of minimizing the free energy is clearly just as complexas that of computing the posterior p(z |x), since the posterior is theonly solution to this problem.

    Variational inference (VI) obtains an approximation of the posteriorp(z |x) by

    I restricting the space Q of variational distributions q(z |x) over whichoptimization of the free energy is carried out;

    I and adopting a specific, typically local, optimization algorithm to tacklethe resulting problem

    minq(z|x)∈Q

    fe(x , q(z |x)).

    Osvaldo Simeone ML4Engineers 30 / 149

  • Variational Inference

    The space of feasible distributions Q can be restricted by imposing:I a factorization of the variational posterior;I a parametrization of the variational posterior;I or both constraints.

    Standard algorithms used for the minimization of the free energyinclude

    I SGDI coordinate descentI and combinations of both methods.

    Osvaldo Simeone ML4Engineers 31 / 149

  • Mean-Field Variational Inference

    Osvaldo Simeone ML4Engineers 32 / 149

  • Mean-Field VI

    We first discuss VI based on the factorization of the variationalposterior.

    Denote as z = [z1, ..., zM ]T the M × 1 latent vector. The simplest

    factorization of the variational posterior is known as mean field, and itassumes that the variational posterior can be written as the product

    q(z |x) =M∏

    m=1

    qm(zm|x),

    where factor qm(zm|x) is the variational posterior distribution forlatent rv zm.

    Mean-field VI makes the strong assumption that the latent variablesare independent given x = x .

    This assumption can drastically reduce the complexity of Bayesianinference at the cost of causing an irreducible bias when the trueposterior distribution p(z |x) does not factorize as assumed.

    Osvaldo Simeone ML4Engineers 33 / 149

  • Mean-Field VI

    To see why mean-field VI can reduce complexity, consider the case inwhich the latent rvs zm ∈ {0, 1} are binary.The unconstrained variational posterior would have 2M − 1parameters to optimize, i.e., all the probabilities q(z |x), for any fixedvalue x . The subtraction by 1 accounts for the constraint that theprobabilities must sum to 1.

    For this case, mean-field VI restricts optimization to the subset Q ofdistributions of the form

    q(z |x) =M∏

    m=1

    Bern(zm|µm),

    which depends on the M parameters µ = [µ1, ..., µM ]T ∈ [0, 1]M ,

    yielding an exponential reduction in the complexity.

    Note that parameters µ generally depend on x , although this is notexplicitly indicated by the notation for simplicity.

    Osvaldo Simeone ML4Engineers 34 / 149

  • Mean-Field VI

    In mean-field VI, the optimization of the free energy is typically doneby means of coordinate descent.

    At each iteration i = 1, 2, ..., we have the current iterate

    q(i−1)(z |x) =M∏

    m=1

    q(i−1)m (zm|x),

    where q(i−1)m (zm|x) represents the current iterate for the mth factor.

    Then, we pick one of the factors, say m ∈ {1, ...,M}. A typicalapproach is to select successively all factors one by one across Miterations.

    With this choice, one set of M iterations, including updates to all Mfactors, is typically referred to as a training epoch.

    Osvaldo Simeone ML4Engineers 35 / 149

  • Mean-Field VI

    At each iteration, we solve the problem of minimizing the free energy overqm(zm|x) when all the other factors are fixed, i.e.,

    minqm(zm|x)

    fe(x , q(z |x) = qm(zm|x) · q(i−1)−m (z−m|x)

    ),

    where we have denoted q(i−1)−m (z−m|x) :=

    ∏m′ 6=m q

    (i−1)m′ (zm′ |x).

    Denoting the optimal solution as q∗m(zm|x),, we set q(i)m (zm|x) = q∗m(zm|x)

    and q(i)m′(zm′ |x) = q

    (i−1)m′ (zm′ |x) for all m′ 6= m, and we move on to the next

    iteration.

    Osvaldo Simeone ML4Engineers 36 / 149

  • Mean-Field VI

    To fully specify mean-field VI, we finally need to describe how toobtain q∗m(zm|x).As we show in the Appendix, this can be computed as

    q∗m(zm|x) ∝ exp(Ez−m [log p(x , zm, z−m)]

    ),

    where z−m ∼ q(i−1)−m (z−m|x) and we have made explicit thedependence of the supervised log-loss − log p(x , z) on both zm andz−m, with the latter including all latent variables values except for zm.

    So, computing q∗m(zm|x) requires averaging the supervised log-lossover all other variables z−m and then normalizing.

    The latter step is easy if rv z−m is low-dimensional or if it takes adiscrete and small number of values.

    In contrast, the average over z−m is generally problematic, unless thelog-loss − log p(x , zm, z−m) can be written as the sum of terms, eachdependent on a small subset of variables.

    Osvaldo Simeone ML4Engineers 37 / 149

  • Example

    In the Ising model, the latent variables {zm}Mm=1 are bipolar, i.e.,zm ∈ {−1,+1}, for m = 1, ...,M; and there are M observations{xm}Mm=1, also bipolar, i.e., xm ∈ {−1,+1}. Each observation xm isassociated with the corresponding latent variable zm.

    The correlation among the latent variables is described by a graphthat has one node for each rv zm and an edge between any two“correlated” rvs.

    This is an example of a probabilistic graphical model formalism knownas Markov networks.

    Osvaldo Simeone ML4Engineers 38 / 149

  • ExampleAs an example, consider a setting in which the observed variablesxm ∈ {−1,+1} correspond to the M pixel values of a black-and-whitenoisy image with −1 corresponding to a white pixel and +1 to ablack pixel. The latent variable zm represents the noiseless value forthe mth pixel.

    Adjacent pixels are expected to be correlated. We can model this byadding an edge between a pixel and the four immediately adjacentpixels in the four directions up, down, left and right.

    Osvaldo Simeone ML4Engineers 39 / 149

  • ExampleDefine as E ⊆ {1, ...,M} × {1, ...,M} the set of edges (m,m′) in thegraph between latent variable zm and zm′ .In the Ising model, the joint distribution factorizes as

    p(x , z) ∝ exp

    η1 ∑(m,m′)∈E

    zmzm′ + η2

    M∑m=1

    zmxm

    =

    ∏(m,m′)∈E

    exp (η1zmzm′) ·M∏

    m=1

    exp (η2zmxm) .

    From this definition, a large (natural) parameter η1 > 0 yields a largeprobability when zm and zm′ with (m,m

    ′) ∈ E are equal; and,similarly, a large η2 > 0 favors configurations in which zm = xm, thatis, with low “observation noise”.Parameters η1 and η2 are assumed to be fixed and they are notsubject to training.It can be seen that the true posterior p(z |x) cannot be written in theproduct form assumed by mean-field VI.Therefore, mean-field VI can only obtain an approximation of the trueposterior.

    Osvaldo Simeone ML4Engineers 40 / 149

  • ExampleAt each iteration, mean-field VI obtains

    q∗m(zm|x) ∝ exp

    η1zm∑

    m′:(m,m′)∈E

    Ezm′∼q

    (i−1)m′ (zm′ |x)

    [zm′ ]︸ ︷︷ ︸:=µ

    (i−1)m′

    +η2xmzm

    .Note that we have

    µ(i−1)m′ = q

    (i−1)m′ (zm′ = 1|x)− q

    (i−1)m′ (zm′ = −1|x) = 2q

    (i−1)m′ (zm′ = 1|x)− 1.

    Imposing normalization so that condition q∗m(zm = 1|x) + q∗m(zm = −1|x) = 1holds, we obtain the normalization constant as

    exp

    η1 ∑m′:(m,m′)∈E

    µ(i−1)m′ + η2xm

    + exp−η1 ∑

    m′:(m,m′)∈E

    µ(i−1)m′ − η2xm

    ,which yields

    q∗m(zm = 1|x) = σ

    2η1 ∑

    m′:(m,m′)∈E

    µ(i−1)m′ + η2xm

    .Osvaldo Simeone ML4Engineers 41 / 149

  • Example

    For a numerical example, consider a 4× 4 binary image z observed asnoisy matrix x, where the joint distribution of x and z is given by theIsing model.

    Note that, according to this model, the observed matrix x is such thateach pixel of the original matrix z is flipped independently withprobability σ(−2η2), i.e., we have p(x |z) =

    ∏m p(xm|zm) with

    p(xm 6= zm|zm) = σ(−2η2).In this small example, it is easy to generate an image x distributedaccording to the model, as well as to compute the exact posteriorp(z |x) by enumeration of all possible images z .The KL divergence KL(p(z |x)||q(z |x)) between the true posteriorand the mean-field VI approximation obtained at the end of eachepoch – with one epoch making one iteration across all variables – isshown in the next figure for η1 = 0.15 and various values of η2.

    Osvaldo Simeone ML4Engineers 42 / 149

  • Example

    As η2 increases, the posterior distribution tends to become deterministic,since x is an increasingly accurate measurement of z. As a result, the finalmean-field approximation is more faithful to the real posterior, since aproduct distribution can capture a deterministic pmf. For smaller values ofη2, however, the bias due to the mean-field assumption yields a large flooron the achievable KL divergence.

    0 1 2 3 4

    10-4

    10-2

    100

    Osvaldo Simeone ML4Engineers 43 / 149

  • VI Based on the Factorization of the Variational Posterior

    Mean-field VI is a message passing algorithm in the sense that, ateach iteration, the scheduled node zm passes its updated variationalposterior q∗m(zm|x) to the other nodes for further processing.Mean-field VI can be generalized by considering factorizations overthe latent variables in which the scopes of the factors, i.e., the set oflatent rvs they depend on, are possibly overlapped.

    This yields more complex message passing algorithms, and is a vastarea of research.

    Osvaldo Simeone ML4Engineers 44 / 149

  • Parametric Variational Inference

    Osvaldo Simeone ML4Engineers 45 / 149

  • Parametric Variational Inference

    In parametric VI, the optimization domain Q consists of variationalposteriors q(z |x , ϕ) that are parametrized by a vector ϕ.Vector ϕ is referred to as the variational parameter vector.

    With parametric VI, we are therefore interested in minimizing the free energy

    minϕ

    fe(x , q(z |x , ϕ))

    with respect to the variational parameters ϕ.

    Note that the optimization is implicitly constrained within a given feasibilitydomain. For instance, if ϕ is a vector of probabilities for binary variables asin the example above, all entries should be in the interval [0, 1].

    It is also important to emphasize that the optimization is done hereseparately for each value x . Therefore, the parameter ϕ that solves the VIproblem above is a function of x , although this is not made explicit by thenotation for simplicity.

    We will discuss later how the optimization can be “amortized” across allvalues x .

    Osvaldo Simeone ML4Engineers 46 / 149

  • Parametric Variational Inference

    A standard solution for parametric VI is to use GD. To implementGD, we need to compute the gradient of the free energy∇ϕfe(x , q(z |x , ϕ)).The complexity of this computation depends on the specific choice forjoint distribution p(x , z) and variational posterior family q(z |x , ϕ).We now first review a general-purpose method known as black-box VIthat applies broadly to most choices of p(x , z) and variationalposterior family q(z |x , ϕ), and we will then study a potentially moreefficient solution that assumes that the variational posterior q(z |x , ϕ)has a specific “reparametrizable” form.

    Osvaldo Simeone ML4Engineers 47 / 149

  • Black-Box Variational Inference

    Osvaldo Simeone ML4Engineers 48 / 149

  • Stochastic Optimization Over Averaging DistributionTo start, let us consider a related problem, which we will connectback to VI later on.

    The problem of interest is the stochastic optimization of an averagewith respect to the averaging distribution. In mathematical terms, theproblem is defined as the minimization

    minϕ

    Ez∼q(z|ϕ)[g(z)]

    for some scalar function g(z) of a vector z .

    The minimization is hence over the parameter vector ϕ of thedistribution over which we compute the average.

    Note that, as we will detail later, the problem of minimizing the freeenergy over the variational parameters has a similar form.

    We will be specifically interested in GD solutions, which requirecomputing the gradient

    ∇ϕEz∼q(z|ϕ)[g(z)].

    Osvaldo Simeone ML4Engineers 49 / 149

  • Example

    As a simple example, consider the problem above withg(z) = 12 (z

    21 + 5z

    22 ) and q(z |ϕ) = N (z |ϕ, I2), where φ ∈ R2 and

    z = [z1, z2]T .

    In this case, we can directly write the problem as

    minϕ

    {Ez∼q(z|ϕ)[g(z)] =

    1

    2(Ez∼q(z|ϕ)[z

    21 + 5z

    22]) =

    1

    2(ϕ21 + 5ϕ

    22 + 6)

    }.

    The gradient can also be directly computed as

    ∇ϕEz∼q(z|ϕ)[g(z)] =[ϕ1

    5ϕ2

    ]and the optimal solution is given as ϕ1 = ϕ2 = 0.

    Osvaldo Simeone ML4Engineers 50 / 149

  • Stochastic Optimization Over Averaging Distribution

    In general, it is not possible to compute the objective Ez∼q(z|ϕ)[g(z)],as well as its gradient ∇ϕEz∼q(z|ϕ)[g(z)], in closed form.We assume that we can, however, evaluate g(z) for any fixed given zat a reasonable computational cost.

    As an example, consider the case in which g(z) is the output of aneural network with input given by z : Computing g(z) is feasible, butaveraging over all possible values of the input as in Ez∼q(z|ϕ)[g(z)] isgenerally not possible.

    How can we obtain an estimate of ∇ϕEz∼q(z|ϕ)[g(z)] by evaluatingg(z) on just a few values of z?

    Osvaldo Simeone ML4Engineers 51 / 149

  • REINFORCE Gradient

    To address this question, we first note that we cannot just draw one,or more, samples z ∼ q(z |ϕ) and directly approximate the expectationEz∼q(z|ϕ)[g(z)] with an empirical average.

    In fact, the parameters ϕ over which we wish to differentiatedetermine the sampling distribution q(z |ϕ). Therefore, the gradientneeds to capture by how much distribution q(z |ϕ) – and, through it,the generated samples – vary with ϕ.

    A solution is given by the REINFORCE gradient, which is animportant element in reinforcement learning algorithms.

    The connection to reinforcement learning is as follows: Inreinforcement learning, an agent takes an action z ∼ q(z |ϕ) byfollowing policy q(z |ϕ). The goal is to optimize the policy parametersϕ such that the average loss Ez∼q(z|ϕ)[g(z)] is minimized.

    Osvaldo Simeone ML4Engineers 52 / 149

  • REINFORCE Gradient

    The derivation of the REINFORCE gradient starts with the followingequality

    ∇ϕEz∼q(z|ϕ)[g(z)] = Ez∼q(z|ϕ)

    g(z)︸︷︷︸loss

    · ∇ϕ log q(z|ϕ)︸ ︷︷ ︸score vector for q(z|ϕ)

    .The proof of this equality follows directly from the trivial identity

    ∇ϕ log q(z |ϕ) =∇ϕq(z |ϕ)q(z |ϕ)

    and is detailed in the Appendix.

    In words, this formula says that the gradient is the weighted average,over q(z |ϕ), of the score vector ∇ϕ log q(z |ϕ), with weights given bythe loss function g(z).

    Osvaldo Simeone ML4Engineers 53 / 149

  • REINFORCE Gradient

    To interpret this formula, recall that the score vector ∇ϕ log q(z |ϕ)points in the direction that maximally increases the probability ofvalue z in the space of parameters ϕ (see Chapter 6).

    With this interpretation in mind, the REINFORCE gradient, whenapplied to update ϕ, “pushes” the resulting updated distributionq(z |ϕ) towards values z that have a large value g(z).The key advantage of this formula is that it is expressed as theexpectation over q(z |ϕ), and hence it can be estimated by drawingsamples from q(z |ϕ).

    Osvaldo Simeone ML4Engineers 54 / 149

  • REINFORCE Gradient

    Accordingly, the REINFORCE gradient algorithm is defined by thefollowing procedure:

    I draw S i.i.d. samples zs ∼ q(z |ϕ) for s = 1, ...,S ;I estimate the gradient as ∇ϕEz∼q(z|ϕ)[g(z)] ' ∇̂ϕEz∼q(z|ϕ)[g(z)] with

    ∇̂ϕEz∼q(z|ϕ)[g(z)] =1

    S

    S∑s=1

    (g(zs)− cs) · ∇ϕ log q(zs |ϕ),

    where cs is an arbitrary constant, not dependent on zs , known asbaseline.

    Osvaldo Simeone ML4Engineers 55 / 149

  • REINFORCE Gradient

    Importantly, the REINFORCE gradient only requires that functiong(z) and score vector ∇ϕ log q(z |ϕ) be computable for a few, here S ,values of z .

    In particular, it does not require the derivatives of the function g(z).

    The REINFORCE estimator can be easily seen to be unbiased for anyconstant c by using the fact that the score vector has zero mean, i.e.,Ez∼q(z|ϕ)[∇ϕ log q(z|ϕ)] = 0 (see Chapter 9). In particular, we have

    Ez∼q(z|ϕ)[(g(z)− c) · ∇ϕ log q(z|ϕ)] = ∇ϕEz∼q(z|ϕ)[g(z)].

    The baseline cs , which may depend on the samples zs′ with s′ 6= s, is

    useful to reduce the variance of the REINFORCE estimator.

    In practice, as we will see below, the baseline cs is typically selectedas an average of the values g(z) observed at prior iterations.

    It is also common to set S = 1.

    Osvaldo Simeone ML4Engineers 56 / 149

  • Example

    Continuing the example above with g(z) = 12 (z21 + 5z

    22 ) and

    q(z |ϕ) = N (z |ϕ, I2), we can obtain an estimate of the gradient, withS = 1, as:

    I draw one sample z ∼ q(z |ϕ);I estimate the gradient as

    ∇̂ϕEz∼q(z|ϕ)[g(z)] =(g(z)− c) · ∇ϕ log q(z|ϕ)=(g(z)− c) · (z− ϕ).

    Note that we have used the general formula for the gradient of anexponential family (see Chapter 9).

    So, the gradient points in the direction of the randomly selected z toan extent that is proportional to the value g(z).

    Osvaldo Simeone ML4Engineers 57 / 149

  • REINFORCE Gradient

    Let us summarize what we have learned so far by returning to theoriginal problem

    minϕ

    Ez∼q(z|ϕ)[g(z)].

    Setting S = 1 for simplicity, a GD method that leverages theREINFORCE gradient algorithm would work as follows:

    initialize ϕ(1) (e.g., randomly)

    for i = 1, 2, ...I draw one sample z(i) ∼ q(z |ϕ(i));I given a learning rate γ(i) > 0, obtain next iterate as

    ϕ(i+1) ← ϕ(i) + γ(i)(c(i) − g(z(i))) · ∇ϕ log q(z(i)|ϕ(i));

    I if stopping criterion satisfied, then return ϕ(i+1); otherwise, continue.

    Note that this is a form of SGD in which the randomness is not overthe selection of the data but over the sampling of latent rvs.

    Osvaldo Simeone ML4Engineers 58 / 149

  • REINFORCE GradientThe REINFORCE gradient algorithm can be thought of as aperturbation-based optimization scheme:

    I Given the current parameter vector ϕ(i), one or more samplesz(i) ∼ q(z |ϕ(i)) are generated to “explore” values of the latent variablesz to which the model q(z |ϕ(i)) assigns a sufficiently large probability;

    I samples that yield a small value of the objective g(z(i))− c(i) are“reinforced”, in the sense that the updated probability distributionq(z |ϕ(i+1)) assigns a large probability to such values of z.

    Being a perturbation-based scheme, it generally suffers from the curseof dimensionality, since exploration of the space of the variables zgenerally requires a larger number of samples as the dimension of zincreases.

    This translates into a large variance for the REINFORCE gradient asthe dimension of the vector z increases. To see this, note that, by thechain rule, the score vector can be written as the sum of terms∇ϕ log q(z(i)|ϕ(i)) =

    ∑Mm=1∇ϕ log q(z

    (i)m |ϕ(i)) for an M-dimensional

    vector z(i) = [z(i)1 , ..., z

    (i)M ]

    T .

    Osvaldo Simeone ML4Engineers 59 / 149

  • REINFORCE Gradient

    How to select the baseline c(i)?

    From the update formula

    ϕ(i+1) ← ϕ(i) + γ(i)(c(i) − g(z(i))) · ∇ϕ log q(z(i)|ϕ(i)),

    we see that the parameter ϕ is updated so as to increase or decreasethe probability of the sampled value z(i) depending on the sign of thedifference (c(i) − g(z(i))).Intuitively, we would like this difference to be positive – so that weincrease the probability of z(i) – when g(z(i)) is smaller than the lossg(z(i−1)) obtained at the previous iterate or of some average1L

    ∑Lj=1 g(z

    (i−j)) of L prior iterates.

    This suggests setting c(i) = 1L∑L

    j=1 g(z(i−j)).

    In the Appendix, we derive an optimized formula for c(i), which leadsto a different formula.

    Osvaldo Simeone ML4Engineers 60 / 149

  • ExampleContinuing the example above, the figure shows the evolution withoutbaseline c(i) = 0 and with baseline with L = 1 over 1000 iterations withγ(i) = 0.2, S = 1, and σ = 0.1 for one realization of the updates. Theinitialization is ϕ(0) = [0.5,−1]T .Note the reduced “variance” of the updates with the baseline.

    Osvaldo Simeone ML4Engineers 61 / 149

  • Black-Box VIHow does the REINFORCE gradient for stochastic optimization over anaveraging distribution relate to VI?

    Fix a value x . The key observation is that the objective of black-box VI,that is, the free energy, can be written as

    fe(x , q(z |x , ϕ)) = Ez∼q(z|x,ϕ)

    log(q(z|x , ϕ)p(x , z)

    )︸ ︷︷ ︸

    =:g(z|ϕ)

    .Therefore, the problem of minimizing the free energy is exactly in the formof an optimization over an averaging distribution with the caveat that thefunction being averaged also depends on the parameter ϕ. Note again thatthis optimization is done separately for each value of x .

    As it is shown in the Appendix, this new aspect does not change theREINFORCE gradient formula, which reads

    ∇ϕfe(x , q(z |x , ϕ)) = Ez∼q(z|x,ϕ)[

    log

    (q(z|x , ϕ)p(x , z)

    )· ∇ϕ log q(z |x , ϕ)

    ].

    Osvaldo Simeone ML4Engineers 62 / 149

  • Black-Box VI

    We can therefore use the REINFORCE gradient to estimate thegradient ∇ϕfe(x , q(z |x , ϕ)).This leads to the following procedure for parametric VI, which appliesto any choice of joint distribution p(x , z) and variational distributionclass q(z |x , ϕ), as long as the score vector ∇ϕ log q(z |x , ϕ) iscomputable.

    Recall again that x is fixed here.

    Osvaldo Simeone ML4Engineers 63 / 149

  • Black-Box VI

    initialize ϕ(1) (e.g., randomly)

    for i = 1, 2, ...I draw one sample z(i) ∼ q(z |x , ϕ(i));I obtain next iterate as

    ϕ(i+1) ← ϕ(i)+γ(i)(c(i) − log

    (q(z(i)|x , ϕ(i))p(x , z(i))

    ))·∇ϕ log q(z(i)|x , ϕ(i));

    I if stopping criterion satisfied, then return ϕ(i+1); otherwise, continue.

    Following the discussion above, the baseline can be set as

    c(i) = 1L∑L

    j=1 log(q(z(i−j)|x ,ϕ(i−j))

    p(x ,z(i−j))

    ).

    Osvaldo Simeone ML4Engineers 64 / 149

  • Example

    Consider the following joint distribution p(x , z) = p(z)p(x |z):

    z ∼ p(z) = Beta(z |a, b)

    x|z ∼ p(x |z) = Exp(x |z) = 1z

    exp(−xz

    )1(x ≥ 0).

    Note that this model is not conjugate, although both prior andlikelihood are distributions in the exponential family, and that the trueposterior is given as

    p(z |x) = Beta(z |a, b)× Exp(x |z)∫Beta(z ′|a, b)× Exp(x |z ′)dz ′

    .

    Recall also that a beta distribution Beta(z |a, b) is supported on theinterval [0, 1] and that a ≥ 1 determines how much the distribution isconcentrated towards 1, while b ≥ 1 plays the same role for the otherend of the support.

    Osvaldo Simeone ML4Engineers 65 / 149

  • Example

    Let us apply black-box VI by assuming a beta posteriorBeta(z |ϕ1, ϕ2) for some fixed value x . We can obtain naturalparameters and mean parameters from relevant tables concerningexponential family distributions.

    Furthermore, by the general formula for the score vector for theexponential family, we have

    ∇ϕ logBeta(z |ϕ1, ϕ2) = s(z)− µ

    =

    [log z

    log(1− z)

    ]︸ ︷︷ ︸

    s(z)

    −[ψ(ϕ1)− ψ(ϕ1 + ϕ2)ψ(ϕ2)− ψ(ϕ1 + ϕ2)

    ]︸ ︷︷ ︸

    µ

    with digamma function ψ (psi in MATLAB).

    Osvaldo Simeone ML4Engineers 66 / 149

  • ExampleThe quality of the approximation obtained by black-box VI depends on howwell a beta distribution can describe the true posterior distribution (in thefigure, we set x = 1, 2000 iterations, S = 30, c(i) = 0, γ(i) = 0.03 fora = b = 1, γ(i) = 0.1 for a = b = 10).

    Osvaldo Simeone ML4Engineers 67 / 149

  • Reparametrization-Based VariationalInference

    Osvaldo Simeone ML4Engineers 68 / 149

  • Reparametrization Trick

    To introduce the idea of reparametrization, let us start again byconsidering the minimization over the averaging distribution

    minϕ

    Ez∼q(z|ϕ)[g(z)]

    for some function g(z).

    As we have seen, we are interested in computing the gradient

    ∇ϕEz∼q(z|ϕ)[g(z)].

    Osvaldo Simeone ML4Engineers 69 / 149

  • Reparametrization Trick

    The REINFORCE gradient only requires that the score vector∇ϕ log q(z |ϕ) be computable.In contrast, the reparametrization trick makes the followingassumption on q(z |ϕ):

    I the rv z ∼ q(z |ϕ) can be generated byF generating an auxiliary variable e ∼ q(e) with distribution q(e) not

    dependent on ϕ;F and setting z = r(e|ϕ) for some function r(e|ϕ) that is differentiable inϕ for every fixed value of e.

    Osvaldo Simeone ML4Engineers 70 / 149

  • Reparametrization Trick

    A typical example, but not the only one, is the Gaussian distributionq(z |ϕ) = N (z |ν,Diag(σ2)), for which the vector of parameters is the2M × 1 vector ϕ = (ν, σ) containing the M × 1 mean vector ν, andthe vector of standard deviations σ. The notation σ2 represents theM × 1 vector of variances.In fact, we can generate a sample z ∼ N (z |ν,Diag(σ2)) as follows

    I generate auxiliary variable

    e ∼ q(e) = N (e|0, I )

    I and set

    z = r(e|ϕ) = ν + σ � e = ν + Diag(σ)e,

    where � denotes the element-wise product.Note that function r(e|ϕ) is linear, and hence differentiable, inϕ = (ν, σ).

    Osvaldo Simeone ML4Engineers 71 / 149

  • Reparametrization Trick

    We note that any scalar variable z ∼ q(z |ϕ) can be generated asfollows:

    I generate auxiliary variable e ∼ U(e|[0, 1])I and set z = F−1(e|ϕ), where F (z |ϕ) = Pr[z ≤ z ] is the cumulative

    distribution function of rv z ∼ q(z |ϕ).The function F−1(e|ϕ) may, however, not be differentiable, e.g., fordiscrete variables.

    Furthermore, it may not be the most convenient representation, e.g.,for the Gaussian example above.

    Osvaldo Simeone ML4Engineers 72 / 149

  • Reparametrization Trick

    Assuming that the variational posterior satisfies thereparameterization property, we can now write the optimizationproblem of interest as

    minϕ

    Ee∼q(e)[g(r(e|ϕ))].

    The key point is that now the averaging distribution does not dependon the parameter vector ϕ.

    Note that, by the chain rule of differentiation, we have the gradient

    ∇ϕg(r(e|ϕ)) = ∇ϕr(e|ϕ) · ∇zg(z)|z=r(e|ϕ),

    where ∇ϕr(e|ϕ) is the Q ×M Jacobian of function r(e|ϕ) for fixed e,with Q being the dimension of ϕ.

    Osvaldo Simeone ML4Engineers 73 / 149

  • Reparametrization Trick

    Therefore, we can obtain an unbiased estimate of the gradient directlyby following this procedure, known as reparametrization trick:

    I draw S i.i.d. samples es ∼ q(e) for s = 1, ...,S ;I estimate the gradient as ∇ϕEz∼q(z|ϕ)[g(z)] ' ∇̂ϕEz∼q(z|ϕ)[g(z)] with

    ∇̂ϕEz∼q(z|ϕ)[g(z)] =1

    S

    S∑s=1

    ∇ϕr(es |ϕ) · ∇zg(zs)|zs=r(es |ϕ).

    Unlike the REINFORCE gradient, the reparametrization-basedgradient depends on the derivative ∇zg(z) of the function to beoptimized, and it generally has a lower variance than the REINFORCEgradient.

    Osvaldo Simeone ML4Engineers 74 / 149

  • Example

    Let us apply the reparametrization trick to the Gaussian distributionq(z |ϕ) = N (z |ν,Diag(σ2)).We parametrize σ as σ = exp(%), applied as an element-wisefunction, and optimize over vector % in order to avoid gradient stepsyielding negative values for the standard deviations.

    The parameter vector is hence given as ϕ = (ν, %).

    Since we have r(e|ϕ) = ν + σ � e = ν + exp(%) � e, the relevantJacobians can be computed as

    ∇νr(e|ϕ) = I∇%r(e|ϕ) = Diag(exp(%) � e).

    Osvaldo Simeone ML4Engineers 75 / 149

  • Example

    The resulting reparametrization-based estimate∇̂ϕEz∼N (z|ν,Diag(σ2))[g(z)] =[∇̂νEz∼N (z|ν,Diag(σ2))[g(z)]T , ∇̂%Ez∼N (z|ν,Diag(σ2))[g(z)]T ]T , statedfor S = 1 for simplicity of notation, is obtained as:

    I draw sample e ∼ N (e|0, I )I compute z = ν + σ � e, and set

    ∇̂νEz∼N (z|ν,Diag(σ2))[g(z)] = ∇zg(z)∇̂%Ez∼N (z|ν,Diag(σ2))[g(z)] = Diag(exp(%) � e)∇zg(z)

    = exp(%) � e �∇zg(z).

    Osvaldo Simeone ML4Engineers 76 / 149

  • ExampleContinuing the example above with g(z) = z21 + 5z

    22 and Gaussian variational

    posterior, the figure shows the evolution of the loss g(ν) evaluated at the mean ofthe distribution q(z |ϕ) for both REINFORCE gradient (with baseline, σ = 1 andγ(i) = 0.3) and with the reparametrization trick (with γ(i) = 0.05) over 100iterations for 10 realizations.

    The reparametrization gradient has a much reduced variance as is clear from thefaster, and less noisy, convergence in the figure.

    0 10 20 30 40 50 60 70 80 90 100

    0

    1

    2

    3

    4

    0 10 20 30 40 50 60 70 80 90 100

    0

    1

    2

    3

    Osvaldo Simeone ML4Engineers 77 / 149

  • Reparametrization-Based VI

    How can the reparametrization-based gradient for stochasticoptimization over an averaging distribution be applied to VI?

    As we have observed, the free energy can be written as

    fe(x , q(z |x , ϕ)) = Ez∼q(z|x ,ϕ)

    log(q(z|x , ϕ)p(x , z)

    )︸ ︷︷ ︸

    =:g(z|ϕ)

    ,and hence the function being averaged also depends on ϕ. Note againthat the value of x is fixed here.

    We need to address this dependence in order to apply thereparametrization trick.

    Osvaldo Simeone ML4Engineers 78 / 149

  • Reparametrization-Based VIA common approach is to express the free energy as (see Appendix ofChapter 7)

    fe(x , q(z |x , ϕ)) = Ez∼q(z|x ,ϕ) [− log p(x |z)]︸ ︷︷ ︸average log-loss of the soft predictor p(x |z)

    + KL(q(z |x , ϕ)||p(z))︸ ︷︷ ︸KL divergence between variational posterior and prior

    .

    We can then directly use the reparametrization trick for the first termsince the function being optimized does not depend on ϕ.For the second term, if we choose variational posterior q(z |x , ϕ) andprior p(z) to be from the same class in the exponential family, the KLdivergence can be computed in closed form (see Chapter 9), and socan its gradient.As an example, if we choose a Gaussian prior and a Gaussianvariational posterior, we can estimate the gradient of the first termusing the reparametrization trick, while the gradient of the secondterm can be computed in closed form.

    Osvaldo Simeone ML4Engineers 79 / 149

  • Reparametrization-Based VI

    To elaborate on the gradient of the second term, let us select theprior to be from a class of distributions in the exponential family witha fixed mean parameter µp, i.e.,

    p(z) = ExpFam(z |µp),

    and the variational posterior be in the same class with

    q(z |x , ϕ) = ExpFam(z |µ(ϕ)),

    where the mean parameter vector µ(ϕ) is a function of the variationalparameter ϕ.

    We assume minimal parametrization, and the corresponding naturalparameters are defined as ηp and η(ϕ).

    Osvaldo Simeone ML4Engineers 80 / 149

  • Reparametrization-Based VI

    Using the analytical form of the KL divergence, the gradient of theentropy of the KL divergence KL(q(z |µ)||p(z |µp)) with respect to µcan be computed as (see Appendix)

    ∇µKL(q(z |µ)||p(z |µp)) = η − ηp,

    and hence we have

    ∇ϕKL(q(z |µ(ϕ))||p(z |µp)) = ∇ϕµ(ϕ) · (η(ϕ)− ηp).

    Note that if the variational parameters ϕ are the natural parametersof distribution q(z |x , ϕ), as discussed in the previous chapter, naturalgradient descent would require the application of the gradient withrespect to µ in the first equation. This can simplify theimplementation.

    Osvaldo Simeone ML4Engineers 81 / 149

  • Example

    For the standard case of Gaussian distributions, recalling that for N (ν,Σ)the mean parameters are given as µ = [ν,Σ + ννT ], we have the gradients

    ∇νKL(N (ν,Σ)||N (νp,Σp)) = Σ−1p (ν − νp)

    ∇ΣKL(N (ν,Σ)||N (νp,Σp)) =1

    2

    (Σ−1p − Σ−1

    ),

    and hence as a special case we have

    ∇νKL(N (ν,Diag(σ2))||N (0, I )) = ν

    ∇σKL(N (ν,Diag(σ2))||N (0, I )) = σ − σ−1,

    where the inverse in σ−1 is applied element-wise.

    Based on the derivations above, with q(z |x , ϕ) = N (z |ν,Diag(σ2)) withϕ = (ν, %) and σ = exp(%), as well as p(z) = N (z |0, I ),we have thefollowing reparametrization-based VI algorithm.

    Osvaldo Simeone ML4Engineers 82 / 149

  • Reparametrization-Based VI

    initialize ϕ(1) = (ν(1), %(1)) (e.g., randomly)

    for i = 1, 2, ...

    I draw one sample e(i) ∼ N (0, I );I compute z(i) = ν(i) + exp(%(i)) � e(i), and estimate the gradients as

    ∇̂ν fe(x , q(z |x , ϕ(i))) = ∇z(− log p(x |z(i))) + ν(i)

    ∇̂%fe(x , q(z |x , ϕ(i))) = exp(%(i)) � e(i) �∇z(− log p(x |z(i)))+ exp(%(i)) � (σ(i) − (σ(i))−1)

    I obtain the next iterate as

    ϕ(i+1) ← ϕ(i) − γ(i)∇̂ϕfe(x , q(z |x , ϕ(i)))

    where∇̂ϕfe(x , q(z |x , ϕ(i))) = [∇̂ν fe(x , q(z |x , ϕ(i)))T , ∇̂%fe(x , q(z |x , ϕ(i)))T ]T .

    I if stopping criterion satisfied, then return ϕ(i+1); otherwise, continue.

    Osvaldo Simeone ML4Engineers 83 / 149

  • Example

    Let us consider an example for which the posterior distribution can becomputed exactly in order to validate reparametrization-based VI.

    Specifically, consider the joint distribution given by z ∼ N (0, 1) andx|z = z ∼ N (z , β−1) for some precision β−1.From Chapter 3, we know that the posterior distribution is given as

    p(z |x) = N(z

    ∣∣∣∣ βx1 + β , 11 + β).

    Let us now assume a variational posterior q(z |ϕ) = N (z |ν, σ2).We plot the KL divergence KL(q(z |ϕ(i))||p(z |x)) =12

    [(1 + β)(σ(i))2 − log

    ((1 + β)(σ(i))2

    )+ (1 + β)( βx1+β − ν

    (i))2 − 1].

    Osvaldo Simeone ML4Engineers 84 / 149

  • Example

    The dashed arrow points in the direction of increasing values of theiteration index i , showing an increasing accurate estimate of thecorrect prior (β = 0.5, x = 1, ϕ(0) = [−2, 0.1]T , γ(i) = 0.01).

    -3 -2 -1 0 1 2 3

    0

    0.5

    1

    0 10 20 30 40 50 60 70 80 90 100

    0

    2

    4

    6

    Osvaldo Simeone ML4Engineers 85 / 149

  • Other Parametric Approximation Methods

    Osvaldo Simeone ML4Engineers 86 / 149

  • Factorized and Parametric Variational PosteriorsWe have seen above that VI can be applied on the space of factorizeddistributions, in which case optimization is typically done via coordinatedescent; or by assuming parametric variational posteriors, in which caseoptimization is typically done by SGD.

    The two approaches can be combined: The free energy can be optimizedover the space Q of variational posteriors that factorize as

    q(z |x , ϕ) =M∏

    m=1

    qm(zm|x , ϕm),

    where each factor qm(zm|x , ϕm) is a parametric function.More general factorizations over the latent variables in which the scopes ofthe factors are possibly overlapped are also naturally possible.

    Minimization of the free energy can be carried out by coordinate-wise SGD,in which the optimization over each parameter vector ϕm is done iterativelyvia SGD.

    The derivation follows from a direct combination of mean-field VI andparametric VI methods seen above.

    Osvaldo Simeone ML4Engineers 87 / 149

  • Laplace Approximation

    Variational inference is not the only way to obtain a parametricapproximation of a posterior distribution.

    A simpler approach is known as Laplace approximation.

    The idea is to fit a Gaussian distribution around the maximum aposteriori (MAP) solution zMAP = arg maxz p(x , z). Note that thiscorresponds to the point at which the posterior is maximized.

    Osvaldo Simeone ML4Engineers 88 / 149

  • Laplace Approximation

    Laplace approximation chooses a Gaussian distributionq(z |x) = N (z |zMAP ,Θ−1), where the precision matrix Θ is chosen to matchthe Hessian (wth respect to z) of the log-loss − log p(z |x) at z = zMAP .The Hessian of the negative log-posterior − log p(z |x) with respect to zequals the Hessian of the negative log-joint distribution − log p(x , z). This isbecause the multiplicative normalization term needed to obtain the posteriorfrom the joint does not depend on the value of z . This property is importantsince one does not have access to p(z |x),which we are trying toapproximate, but only to − log p(x , z).To elaborate, note that the Hessian of − log q(z |x) = − logN (z |zMAP ,Θ−1)with respect to z is equal to Θ. Accordingly, Laplace approximationcomputes the Hessian

    ∇2z(− log p(z |x)) = ∇2z(− log p(x , z))

    with respect to z , and then sets Θ = ∇2z(− log p(x , z)).Laplace approximation can be quite accurate for posterior distributions thathave a single mode.

    Osvaldo Simeone ML4Engineers 89 / 149

  • Example

    The figure below shows the Laplace approximation for the beta-exponentialexample introduced above. The approximation is more accurate the closerthe posterior distribution is to a Gaussian distribution.

    0 0.2 0.4 0.6 0.8 1 1.2

    0

    1

    2

    3

    4

    0 0.2 0.4 0.6 0.8 1 1.2

    0

    0.5

    1

    1.5

    2

    Osvaldo Simeone ML4Engineers 90 / 149

  • Non-Parametric VI

    Osvaldo Simeone ML4Engineers 91 / 149

  • Non-Parametric VI

    In parametric VI, the variational posterior q(z |x , ϕ), for any fixed x , isparametrized by a vector ϕ of parameters. An example is given bymean and variance vectors for Gaussian variational posteriors.

    When the true posterior p(z |x) cannot be well approximated by anyof the distributions in the set of variational posteriors q(z |x , ϕ), thesolution of the parametric VI problem is impaired by an irreduciblebias.

    For instance, consider the case in which the true posterior ismulti-modal, while the variational distribution is chosen as Gaussian,and it is hence single-modal. The variational posterior cannot capturemore than one mode of the posterior.

    As we discuss next, an alternative approach is to parametrize thevariational distribution with a flexible number of “particles” in thespace of the latent variables through a kernel density estimator(KDE).

    Osvaldo Simeone ML4Engineers 92 / 149

  • Non-Parametric VIIn Chapter 7, we have seen that a distribution q(z) can beapproximated with a number S of “particles” {zs}Ss=1 using the kerneldensity estimator (KDE)

    q(z) ' 1S

    S∑s=1

    κh(z − zs),

    where kh(z) is a (non-negative) kernel function with “bandwidth” h,such as κh(z) = N (z |0, h).This suggest to parametrize the variational posterior with a setϕ = {zs}Ss=1 of particles, as

    q(z |x , ϕ) = 1S

    S∑s=1

    κh(z − zs).

    The particles depend implicitly on the value x as for parametric VI.Unlike parametric VI, the parameters ϕ = {zs}Ss=1 do not impose aspecific structure on the variational posterior apart from thesmoothness properties implied by the choice of a particular kernel.

    Osvaldo Simeone ML4Engineers 93 / 149

  • Non-Parametric VI

    The outlined “non-parametric parametrization” of the variationalposterior can flexibly represent any posterior distribution as long asthe number of particles is large enough and the posterior distributionis sufficiently smooth.

    As all non-parametric techniques, this approach suffers from the curseof dimensionality: As the dimensionality of the latent space increases,the number of required particles generally increases exponentially.

    That said, for sufficiently small dimensions of the latent vector z , theflexibility of the non-parametric model can yield significantly moreaccurate approximations of the posterior that are not limited by thebias of parametric models.

    Osvaldo Simeone ML4Engineers 94 / 149

  • Stein Variational Gradient Descent (SVGD)Non-parametric VI is hence formulated as the problem of minimizingthe free energy over the particles ϕ = {zs}Ss=1, i.e.,

    minϕ

    fe(x , q(z |x , ϕ)).

    Stein Variational Gradient Descent (SVGD) is an iterative methodthat updates a set of particles in the direction that locally minimizesthe free energy.

    Given a set of particles ϕ(i−1) = {z(i−1)s }Ss=1 at iteration i − 1, SVGDapplies a smooth transformation r(·) to all particles as:

    z(i)s ←− z(i−1)s + γ(i)r(z(i−1)s ), for s = 1, ...,S

    with learning rate γ(i).

    The transformation r(·) is optimized to maximize the rate of localdescent of the free energy with a specific functional class known asreproducing kernel Hilbert space (RKHS).

    Osvaldo Simeone ML4Engineers 95 / 149

  • Stein Variational Gradient Descent (SVGD)

    A RKHS is determined by a positive semidefinite kernel functionκ(z , z ′). In particular, an RKHS contains functions that can beexpressed as linear combinations of functions κ(z , ·) for a number ofvalues z .

    As discussed in Chapter 4 (see Appendix), a positive semi-definitekernel function κ(z , z ′) can be thought of defining the degree of“similarity” between two vectors z and z ′.

    A typical example of a positive semi-definite kernel function is the“Gaussian” kernel κ(z , z ′) = exp(−||z − z ′||2). Note that the kernelneed not be normalized.

    Note also that the positive semi-definite kernel function κ(·, ·) isdifferent, and conceptually different, from the non-negative kernelfunction κh(·) used for KDE.By restricting optimization within a specific subset of a RKHS andapproximating the resulting expectation, it can be proven that theoptimal update operates as summarized in the following algorithm.

    Osvaldo Simeone ML4Engineers 96 / 149

  • Stein Variational Gradient Descent (SVGD)

    initialize particles ϕ(1) = {z(1)s }Ss=1 (e.g., randomly)for i = 1, 2, ...

    I for all particles s = 1, ...,S , obtain the next iterate as

    z (i+1)s ←− z (i)s + γ(i)r (i)s

    with

    r (i)s =S∑

    s′=1

    κ(z (i−1)s , z(i−1)s′ )

    [∇zs′ log p(x , z

    (i−1)s′ )︸ ︷︷ ︸

    negative supervised loss gradient

    ]

    +∇zs′κ(z(i−1)s , z

    (i−1)s′ )︸ ︷︷ ︸

    repulsive “force”

    ;

    I if stopping criterion satisfied, then return ϕ(i+1) = {z (i+1)s }Ss=1;otherwise, continue.

    Osvaldo Simeone ML4Engineers 97 / 149

  • Stein Variational Gradient Descent (SVGD)

    As a result, each particle is updated in a direction that depends on allother particles, and:

    I The first term moves each particle zs in the direction that minimizesthe supervised loss by weighting the contribution of each other particlezs′ by their similarity κ(zs , zs′);

    I and the second term ensures that particles are repulsed if they are “toosimilar”.

    To understand the operation of the repulsive force due to the secondterm, consider the use of the kernel κ(zs , zs′) = exp(−||zs − zs′ ||2),for which we have

    ∇zs′k(zs , zs′) = 2(zs − zs′)κ(zs , zs′).

    This contribution to the update of particle zs hence drives particle zsaway from particles zs′ , i.e., in the direction (zs − zs′), to a degreethat is proportional to the similarity κ(zs , zs′).

    Osvaldo Simeone ML4Engineers 98 / 149

  • ExampleIn the figure, the blue line is the true posterior (obtained by normalizingp(x , z)), and the red line represents the KDE, with h = 0.1 obtained fromthe S = 20 particles as a function of the iteration number i .The particles first move towards the first peak of the posterior driven by thefirst term in the update and are then repulsed to cover the support of thedistribution by the second term.

    -4 -2 0 2 4

    0

    0.2

    0.4

    -4 -2 0 2 4

    0

    0.2

    0.4

    -4 -2 0 2 4

    0

    0.2

    0.4

    -4 -2 0 2 4

    0

    0.2

    0.4

    -4 -2 0 2 4

    0

    0.2

    0.4

    -4 -2 0 2 4

    0

    0.2

    0.4

    Osvaldo Simeone ML4Engineers 99 / 149

  • Amortized Variational Inference

    Osvaldo Simeone ML4Engineers 100 / 149

  • Amortized VI

    Up to now, we have studied the problem of computing anapproximation of the posterior p(z |x) for a given value x .This yields computationally intensive solutions whereby theoptimization of the variational posterior q(z |x , ϕ) needs to berepeated for all values of x of interest.

    Amortized VI aims at “amortizing” the complexity of VI by sharingthe parameters ϕ of the variational posterior q(z |x , ϕ) for all values x .This is done by minimizing the average free energy

    minϕ

    Ex∼p(x)[fe(x, q(z |x, ϕ))],

    where the average is taken over the marginal distribution p(x) of theinput.

    The model q(z |x , ϕ) can be any parametrized conditional probability.

    Osvaldo Simeone ML4Engineers 101 / 149

  • Amortized VIAs an example, a typical choice for the amortized variational posterioris the Gaussian distribution defined as

    q(z |x , ϕ) = N

    z∣∣∣∣∣∣∣∣∣∣∣

    ν1(x |ϕ)ν2(x |ϕ)

    ...νM(x |ϕ)

    ,

    s1(x |ϕ)2 0 · · · 0

    0 s2(x |ϕ)2...

    ... 0. . . 0

    0 · · · 0 sM(x |ϕ)2

    = N

    (z∣∣∣ν(x |ϕ),Diag(s(x |ϕ)2)) ,

    where the 2M × 1 vector (ν(x |ϕ), log(s(x |ϕ))) is the output of aneural network with weights ϕ and input x .

    Note that this parametrization assumes standard deviationssm(x |ϕ) = exp(log(sm(x |ϕ))), which ensures a positive value for allpossible (finite) outputs log(s(x |ϕ)) of the neural network.A key advantage of this parametrization is that, assuming a Gaussianprior p(z) = N (z |0, I ), it enables the use of the reparametrizationtrick, as detailed next.

    Osvaldo Simeone ML4Engineers 102 / 149

  • Amortized VIUsing this choice for the amortized variational posterior, to apply thereparametrization trick with S = 1, we draw one sample x ∼ p(x) and oneauxiliary sample e ∼ N (0, I ). Furthermore, at the given parameter vector ϕ,we compute z = ν(x|ϕ) + s(x|ϕ) � e.We then need to compute the gradient estimate ∇̂ϕEx∼p(x)[fe(x, q(z |x, ϕ))],which is given as

    ∇ϕ(ν(x|ϕ) + s(x|ϕ) � e) · ∇z(− log p(x|z)) +∇ϕKL(q(z |x , ϕ)||p(z)).

    The first term can be directly computed as

    (∇ϕν(x|ϕ) +∇ϕs(x|ϕ) � e) · ∇z(− log p(x|z)),

    where the Jacobians ∇ϕν(x|ϕ) and ∇ϕs(x|ϕ), as well as the log-lossgradient ∇z(− log p(x|z)) can be directly computed, typically via automaticdifferentiation, depending on the given implementation.

    Using the gradients seen above for the KL divergence, the second term canbe evaluated as

    ∇ϕKL(q(z |x , ϕ)||p(z)) = ∇ϕν(x|ϕ)·ν(x|ϕ)+∇ϕs(x|ϕ)·(s(x|ϕ)−(s(x|ϕ))−1).

    Osvaldo Simeone ML4Engineers 103 / 149

  • Reparametrized Amortized VI

    initialize ϕ(1) (e.g., randomly)

    for i = 1, 2, ...

    I draw one sample x(i) ∼ p(x) and one auxiliary sample e(i) ∼ N (0, I );I compute z(i) = ν(x(i)|ϕ(i)) + s(x(i)|ϕ(i)) � e(i) and estimate the gradients as∇ϕEx∼p(x)[fe(x, q(z |x, ϕ(i)))] ' ∇̂ϕEx∼p(x)[fe(x, q(z |x, ϕ(i)))] with

    ∇̂ϕEx∼p(x)[fe(x, q(z |x, ϕ(i)))] =(∇ϕν(x(i)|ϕ(i)) +∇ϕ(s(x(i)|ϕ(i)) � e(i)))· ∇z(− log p(x(i)|z(i)))

    +∇ϕν(x(i)|ϕ(i)) · ν(x(i)|ϕ(i))

    +∇ϕs(x(i)|ϕ(i)) · (s(x(i)|ϕ(i))− (s(x(i)|ϕ(i)))−1);

    I obtain next iterate as

    ϕ(i+1) ← ϕ(i) − γ(i)∇̂ϕEx∼p(x)[fe(x, q(z |x, ϕ(i)))];

    I if stopping criterion satisfied, then return ϕ(i+1); otherwise, continue.

    Osvaldo Simeone ML4Engineers 104 / 149

  • Reparametrized Amortized VI

    Note that the expression above for the gradient estimate∇̂ϕEx∼p(x)[fe(x, q(z |x, ϕ(i)))] assumes the general case in which theparameters ϕ affect both mean and standard deviation vectors.

    If disjoint parts of the vector ϕ enter the mean and standard deviationvectors, the gradient with respect to each part of the vector includesonly the relevant two terms in the sum presented in the previous slide.

    Osvaldo Simeone ML4Engineers 105 / 149

  • Black-Box Amortized VI

    When the variational distribution does not have a reparametrizablestructure, one can use black-box VI, i.e., the REINFORCE gradient, inlieu of the reparametrization trick, yielding the algorithm below.

    initialize ϕ(1) (e.g., randomly)

    for i = 1, 2, ...I draw one sample x(i) ∼ p(x) and one sample for the latent variable as

    z(i) ∼ q(z |x(i), ϕ(i));I obtain next iterate as

    ϕ(i+1) ← ϕ(i)+γ(i)(c(i) − log

    (q(z(i)|x(i), ϕ(i))p(x(i), z(i))

    ))·∇ϕ log q(z(i)|x(i), ϕ(i));

    I if stopping criterion satisfied, then return ϕ(i+1); otherwise, continue.

    Following the discussion above, the baseline can be set as

    c(i) = 1L∑L

    j=1 log(q(z(i−j)|x(i−j),ϕ(i−j))

    p(x(i−j),z(i−j))

    ).

    Osvaldo Simeone ML4Engineers 106 / 149

  • Amortized VI

    An important observation is that the algorithms summarized aboveare doubly stochastic: The estimate of the gradient requires samplingboth data and auxiliary variables for the latent.

    They can be directly generalized by using multiple samples periteration rather than a single one.

    Osvaldo Simeone ML4Engineers 107 / 149

  • Variational EM

    Osvaldo Simeone ML4Engineers 108 / 149

  • Variational EM

    As we have seen in Chapter 7, optimal soft prediction plays a key rolein the design of training algorithms for models with latent variables.

    We will now discuss how VI can be leveraged to make theimplementation of such algorithms more efficient and scalable.

    Consider a generative model p(x , z |θ) with observed vector x, latentvector z, and model parameter θ. As we have seen, this model maybe used in different settings:

    I for unsupervised learning of directed and undirected generative models;I and for supervised learning of mixture models (in which case x

    represents the output and there in an implicit conditioning on theinput).

    Given a data set D = {xn}Nn=1, we specifically focus on the MLproblem

    minθ− 1

    N

    N∑n=1

    log p(xn|θ).

    Osvaldo Simeone ML4Engineers 109 / 149

  • Variational EM

    As we have seen in Chapter 7, this is equivalent to the problem

    minθ

    1

    N

    N∑n=1

    minqn(zn|xn)

    fe(xn, qn(zn|xn)|θ).

    where the free energy is given as

    fe(x , q(z |x)|θ) = Ez∼q(z|x) [− log p(x , z|θ)]︸ ︷︷ ︸= `s (x, q(z|x)|θ), supervised log-loss

    −Ez∼q(z|x) [− log q(z|x)]︸ ︷︷ ︸=H(q(z|x))

    .

    Based on this equivalence, the EM algorithm alternates betweenoptimization over the variational posteriors qn(zn|xn) for alln = 1, ...,N in the E step and optimizing over model parameter θ inthe M step.

    Osvaldo Simeone ML4Engineers 110 / 149

  • Variational EM

    The EM algorithm assumes that both steps can be carried outexactly:

    I given the current iterate θold, the E step returns the optimal solutionqn(zn|xn) = p(zn|xn, θold), with p(zn|xn, θold) being the posteriorcomputed from the joint p(xn, zn|θold);

    I and the M step optimizes exactly the problem

    minθ

    {1

    N

    N∑n=1

    `s(x , p(zn|xn, θold)|θ)

    }.

    Variational EM (VEM) replaces the E step with the approximateposterior obtained via amortized VI, while the M step is also typicallyapproximated via SGD.

    Osvaldo Simeone ML4Engineers 111 / 149

  • Variational EM

    Specifically, VEM introduces a class Q of parametric posterior modelsq(z |x , ϕ), and it tackles the problem

    minθ,ϕ

    1

    N

    N∑n=1

    fe(xn, q(zn|xn, ϕ)|θ).

    Note that the variational parameter ϕ is shared across all data points: Thevariational model is amortized across the entire data set.

    As another remark, one could consider factorized, and possibly parametrized,posteriors, but we will not detail this option here.

    Depending of the form of the variational posterior, optimization over ϕ canbe done using either black-box VI, i.e., using the REINFORCE gradient, orthe reparametrization trick.

    Furthermore, there are several ways to carry out the optimization over modelparameters θ and variational parameters ϕ.

    Osvaldo Simeone ML4Engineers 112 / 149

  • Variational EM

    per-observation

    latent variables

    model parameter

    𝑧𝑛

    observation

    𝑥𝑛

    n=1,…,𝑁

    𝜃

    𝑞(𝑧𝑛 |𝑥𝑛 , φ)

    𝑝(𝑥𝑛 , 𝑧𝑛 |𝜃)

    variational param.

    φ

    Osvaldo Simeone ML4Engineers 113 / 149

  • Variational EM

    Below, we detail a standard approach based on SGD, in which optimizationsteps for E and M steps are carried out simultaneously.

    The gradients over both model and variational parameters are estimatedusing doubly stochastic estimators based on samples from observed andlatent variables. In particular, the gradient over the variational parameters isobtained via black-box amortized VI.

    Osvaldo Simeone ML4Engineers 114 / 149

  • Black-Box VEMinitialize θ(1), ϕ(1) (e.g., randomly)for i = 1, 2, ...

    I draw one sample x(i) from the data set D and one sample for thelatent variable as z(i) ∼ q(z |x(i), ϕ(i));

    I obtain next iterate for the variational parameters using theREINFORCE gradient (black-box VI) as

    ϕ(i+1) ← ϕ(i)+γ(i)(c(i) − log

    (q(z(i)|x(i), ϕ(i))p(x(i), z(i)|θ(i))

    ))·∇ϕ log q(z(i)|x(i), ϕ(i));

    I obtain next iterate for the model parameters

    θ(i+1) ← θ(i) + ξ(i)∇θ log p(x(i), z(i)|θ(i));

    I if stopping criterion satisfied, then return θ(i+1) and ϕ(i+1) otherwise,continue.

    Following the discussion above, the baseline can be set as

    c(i) = 1L∑L

    j=1 log(q(z(i−j)|x(i−j),ϕ(i−j))p(x(i−j),z(i−j)|θ(i−j))

    ).

    Osvaldo Simeone ML4Engineers 115 / 149

  • Example

    Consider again the beta-exponential model, but let us assume nowthat the likelihood contains a trainable parameter θ. Mathematicallywe have the following joint distribution p(x , z |θ):

    z ∼ p(z) = Beta(z |a, b)

    x|z ∼ p(x |z , θ) = Exp(x |θz) = 1z

    exp(− xθz

    )1(x ≥ 0),

    where the prior parameters (a, b) are fixed.

    Unlike the example above, in which we focused on the approximationof the posterior p(z |x), we are now interested in training thegenerative model p(x , z |θ) = p(z)p(x |z , θ) via ML based on a dataset D of observations D = {xn}Nn=1.

    Osvaldo Simeone ML4Engineers 116 / 149

  • Example

    To this end, we apply black-box VEM by assuming a beta variationalposterior.

    We consider first a non-amortized solution in which at each iterationi , we run several (here 100) iterations of black-box VI in order to

    obtain a variational distribution q(z |x(i), ϕ(i)) = Beta(z |ϕ(i)1 , ϕ(i)2 )

    that approximates the posterior p(z |x(i), θ(i)).This solution requires to run a separate black-box VI process for eachiteration.

    The figure shows one random run of the algorithm.

    Osvaldo Simeone ML4Engineers 117 / 149

  • Example

    The panels on the right show the evolution of the variationalparameters for two iterations i .

    4.95 5 5.05 5.1 5.15

    4.8

    4.85

    4.9

    4.95

    5

    4.6 4.7 4.8 4.9 5

    4.9

    4.95

    5

    5.05

    5.1

    0 500 1000

    0.1

    0.2

    0.3

    0.4

    0.5

    0.6

    0.7

    0.8

    0.9

    1

    Osvaldo Simeone ML4Engineers 118 / 149

  • ExampleWe then consider the (amortized) black-box VEM algorithm detailed above. Forthis purpose, we set the amortized posterior q(z |x , ϕ) = Beta(z |ϕ1(1 + x), ϕ2),where ϕ = [ϕ1, ϕ2]

    T ∈ R2.Note that here the variational parameters evolve with the model parameters alongthe same number of iterations.

    The figure shows one random run of the algorithm.

    Osvaldo Simeone ML4Engineers 119 / 149

  • Variational Autoencoders

    Variational autoencoders (VAEs) are an instance of VEM that usesthe reparametrization trick, instead of black-box VI, in order toestimate the gradient over the variational parameters.

    Accordingly, VAEs set the amortized variational posterior as

    q(z |x , ϕ) = N(z∣∣ν(x |ϕ),diag(s(x |ϕ)2))

    as discussed above, as well as the prior as p(z) = N (z |0, I ).The resulting algorithm can be derived as above but withreparametrization trick in lieu of the REINFORCE gradient.

    Osvaldo Simeone ML4Engineers 120 / 149

  • Variational EM

    To conclude this section, we highlight a limitation of VEM, oftenreferred to as posterior mode collapse.

    Assuming models of the form p(x , z |θ) = p(z)p(x |z, θ), VEMaddresses the problem

    minθ,ϕ

    1

    N

    N∑n=1

    fe(xn, q(zn|xn, ϕ)|θ),

    where the free energy is defined as

    fe(x , q(z |x , ϕ)|θ) = Ez∼q(z|x ,ϕ) [− log p(x |z, θ)] + KL(q(z |x , ϕ)||p(z)).

    Suppose now that the model p(x |z, θ) has a large capacity, so thatthere exists a model parameter θ such that p(x |z , θ) ' pD(x)irrespective of the value of the latent vector z.

    Osvaldo Simeone ML4Engineers 121 / 149

  • Variational EM

    Then, the optimal variational posterior q(z |x , ϕ) equals the priorp(z), assuming that the variational model class is large enough. As aresult, the variational posterior is not useful for inference.

    The point here is that, when applying the VEM algorithm, one is notguaranteed to obtain a useful variational posterior.

    Some solutions add a mutual information term I(x; z), evaluatedusing the variational posterior, in order to avoid the case outlinedabove in which x and z are independent.

    Osvaldo Simeone ML4Engineers 122 / 149

  • Generalized Bayesian Inference,Generalized VI, and Generalized VEM

    Osvaldo Simeone ML4Engineers 123 / 149

  • Generalized Bayesian InferenceAs we have discussed throughout this chapter, Bayesian inference can beformulated as the optimization problem

    minq(z|x)

    Ez∼q(z|x) [− log p(x |z)]︸ ︷︷ ︸average log-loss in reconstruction of x from z

    + KL(q(z |x)||p(z))︸ ︷︷ ︸divergence between variational posterior and prior

    for a fixed observation x .When the log-loss is substituted by any other loss function `(x , z) and theKL divergence with any other regularizing measure C(q(z |x)), constrainedto be a convex function of q(z |x), we obtain generalized Bayesian inference

    minq(z|x)

    Ez∼q(z|x) [`(x , z)]︸ ︷︷ ︸average loss in reconstruction of x from z

    + α C(q(z |x))︸ ︷︷ ︸regularizing “complexity” measure for the variational posterior

    ,

    where α > 0 is a “temperature” parameter.We will refer to the objective of the problem above as generalized freeenergy.

    Osvaldo Simeone ML4Engineers 124 / 149

  • Generalized Bayesian Inference

    Examples of “complexity” measures C(q(z |x)) includeI the KL divergence KL(q(z |x)||p(z)) with respect to some prior

    distribution p(z);I the negative entropy H(q(z |x));I and the quadratic total variation distance

    12 ||q(z |x)− p(z)||

    2 = 12∑

    z |q(z |x)− p(z)|2 with respect to some priordistribution p(z) (with sum replaced by an integral for continuouslatent rvs);

    I or other divergence measures between q(z |x) and prior p(z) (see nextchapter).

    Osvaldo Simeone ML4Engineers 125 / 149

  • Generalized Bayesian Inference

    When the KL divergence C(q(z |x)) = KL(q(z |x)||p(z)) is used, the optimalsolution of the generalized Bayesian inference problem can be derived asfollows.

    Let us write the generalized free energy as

    Ez∼q(z|x) [`(x , z)] + αKL(q(z |x)||p(z))

    = Ez∼q(z|x) [`(x , z)] + α Ez∼q(z|x)

    [log

    (q(z|x)p(z)

    )]=α Ez∼q(z|x)

    [log

    (q(z|x)

    p(z) exp(− 1α`(x , z)

    ))]=αKL(q(z |x)||p(z) exp(−α−1`(x , z))),

    where we have used the convention of defining the KL divergence even whenthe second term is not normalized.

    Osvaldo Simeone ML4Engineers 126 / 149

  • Generalized Bayesian Inference

    It follows that the generalized Bayesian inference problem can bereformulated as the minimization

    minq(z|x)

    KL(q(z |x)||p(z) exp(−α−1`(x , z))).

    In the Appendix we show that the solution of the more generalproblem

    minq(z)

    KL(q(z)||p̃(z)),

    where p̃(z) is an unnormalized distribution over z , is given by thenormalized distribution

    q∗(z) =p̃(z)

    Z,

    where constant Z ensures normalization.

    Osvaldo Simeone ML4Engineers 127 / 149

  • Generalized Bayesian Inference

    It follows that the optimal solution of the generalized Bayesianproblem with KL divergence-based regularization is given as

    q∗(z |x) ∝ p(z) exp(− 1α`(x , z)

    ).

    This distribution is known as Gibbs posterior, and it corresponds tothe standard posterior if the loss function is the log-loss, i.e., if`(x , z) = − log p(x |z), and if α = 1.The Gibbs posterior “tilts” the prior p(z) with a term that dependson the loss `(x , z).

    Osvaldo Simeone ML4Engineers 128 / 149

  • Generalized Bayesian Inference

    Examples of generalized posteriors, i.e., of optimal solutions to theproblem of minimizing the free energy can be found in the tablebelow. The derivation follows by leveraging connections to Fenchelduality, as detailed in the Appendix. (Parameter τ satisfies∑

    z(p(z)−1α`(x , z)− τ)

    + = 1 with (x)+ = max(x , 0)).

    C(q(z |x)) gen. posterior q∗(z |x)

    KL(q(z |x)||p(z)) ∝ p(z) exp(− 1α`(x , z))

    −H(q(z |x)) ∝ exp(− 1α`(x , z))12 ||q(z |x)− p(z)||

    2 (p(z)− 1α`(x , z)− τ)+

    Osvaldo Simeone ML4Engineers 129 / 149

  • Example

    The figure below shows the generalized posterior for priorp(z) = N (z |0, 1) and loss function plotted in the figure.

    -1 0 1 2 30

    0.005

    0.01

    -1 0 1 2 30

    0.005

    0.01

    -1 0 1 2 3z

    0

    0.01

    0.02

    ℓ C = −H C = KL C = 0.5|| ∙ ||2

    𝛼 = 10

    𝛼 = 0.5 × 10−2

    𝛼 = 10−3

    𝑞∗

    𝑞∗

    𝑞∗

    Osvaldo Simeone ML4Engineers 130 / 149

  • Example

    When α = 10, the generalized posterior puts more weight onminimizing the complexity term C(q(z |x)).

    I As a result, for C(q(z |x)) = −H(q), the generalized posterior is closeto a uniform distribution, whereas it approximates the prior p(z) forD(q) = KL(q||p) and 0.5||q − p||2.

    For smaller temperature values α, the emphasis shifts on minimizingthe average loss and the generalized posterior increasinglyconcentrates on the minimizer of the loss function `(x , z).

    Osvaldo Simeone ML4Engineers 131 / 149

  • Generalized VI

    When optimizing over a restricted class of posteriors, the generalizedBayesian inference problem is referred to as generalized VI (GVI)

    minq(z|x)∈Q

    Ez∼q(z|x) [`(x , z)] + αC(q(z |x)).

    All the techniques discussed above, with both factorized andparameterized variational posteriors can be applied.

    For instance, parametrized GVI amounts to the problem

    minϕ

    Ez∼q(z|x ,ϕ) [`(x , z)] + αC(q(z |x , ϕ)).

    Osvaldo Simeone ML4Engineers 132 / 149

  • Generalized VEM

    In an analogous way, generalized VEM (GVEM) can also be directlydefined as the problem

    minθ,ϕ

    E(x,z)∼p(x)q(z|x ,ϕ) [`(x, z|θ)] + α Ex∼p(x)[C(q(z |x, ϕ))],

    where the loss function now depends on the model parameters θ.

    The marginal p(x) is generally unknown and replaced with theempirical distribution, yielding the problem

    minθ,ϕ

    1

    N

    N∑n=1

    {Ezn∼q(z|xn,ϕ) [`(xn, zn|θ)] + αC(q(z |xn, ϕ))

    }.

    We will see applications of GEM and GVEM in the next chapter.

    Osvaldo Simeone ML4Engineers 133 / 149

  • Summary

    Osvaldo Simeone ML4Engineers 134 / 149

  • Summary

    Most of the algorithms studied in this chapter tackle special cases of thegeneralized variational EM (VEM) problem. Given a data set D = {(xn)Nn=1}, thisamounts to the optimization over model parameters θ and variational parameter ϕof the generalized free energy criterion as in

    minϕ,θ

    1

    N

    N∑n=1

    Ezn∼q(z|xn,ϕ) [`(xn, zn|θ)]︸ ︷︷ ︸average loss in reconstruction of xn from zn

    N

    N∑n=1

    KL(q(z |xn, ϕ)||p(z |θ))︸ ︷︷ ︸“complexity” of the variational posterior

    ,

    where

    I q(z |x , ϕ) represents a class of variational distributions, in which variationalparameter ϕ is amortized across all values of observation x ;

    I `(x , z |θ) is a loss function, such as the log-loss `(x , z |θ) = − log p(x |z , θ),which depends on model parameter θ;

    I p(z) is a reference prior distribution;I and α is a “temperature” parameter.

    Osvaldo Simeone ML4Engineers 135 / 149

  • Summary

    When the model parameters θ are fixed, we obtain the VI problem.

    For the VI problem, if no constraint is imposed on the form of thevariational posterior, the optimal solution can be efficiently found onlyin the special cases such as when the joint distribution is in theconjugate exponential family;

    Otherwise, the problem can be efficiently tackled either by black-boxVI or by the reparametrization trick.

    I The latter requires a special structure for the variational posterior,while the former can be applied more generally.

    I Both methods apply a doubly stochastic estimate of the gradient.

    When the model parameters θ are to be trained, the problem can beaddressed by the variational EM (VEM) algorithm, in which SGDupdates are applied to both model and variational parameters.

    Osvaldo Simeone ML4Engineers 136 / 149

  • Appendix

    Osvaldo Simeone ML4Engineers 137 / 149

  • Dirichlet-Categorical ModelThe Dirichlet-categorical model is another example of a conjugateexponential-family distribution.The likelihood is given by the categorical distribution

    p(x |z) =C−1∏k=0

    z1(x=k)k ,

    and the conjugate prior is the Dirichlet distribution

    p(z0, ..., zC−1|α0, ..., αC−1) = Dir(z0, ..., zC−1|α0, ..., αC−1) ∝C−1∏k=0

    zαk−1k ,

    with parameters α = (α0, ..., αC−1).The Dirichlet distribution has mean and mode given as

    Ezk∼Dir(z0,...,zC−1|α0,...,αC−1)[zk ] =αk∑C−1j=0 αj

    modezk∼Dir(z0,...,zC−1|α0,...,αC−1)[zk ] =αk − 1∑C−1

    j=0 αj − C.

    Osvaldo Simeone ML4Engineers 138 / 149

  • Dirichlet-Categorical Model

    The posterior distribution given L i.i.d. observationsx1, ..., xL ∼ p(x |z) is the Dirichlet distribution

    p(z |x1, ..., xL) = Dir(z0, ..., zC−1|α0 + L[0], ..., αC−1 + L[C − 1])

    ∝C−1∏k=0

    µαk+L[k]k .

    Osvaldo Simeone ML4Engineers 139 / 149

  • Derivation of Update Equation for Mean-Field VI

    We need to show that we have

    q∗m(zm|x) ∝ exp(Ez−m [log p(x , zm, z−m)]

    ),

    where z−m ∼ q(z−m|x) :=∏

    m′ 6=m q(i−1)m′ (zm′ |x).

    Under the mean-field factorization, the entropy H(q(z|x)) equals thesum

    H(q(z|x)) =M∑

    m′=1

    H(qm′(zm′ |x)).

    Furthermore, by using the law of iterated expectations, we can writethe supervised log-loss as

    `s(x , q(z |x)) = Ez∼q(z|x) [− log p(x , z)] .= Ezm

    [Ez−m [− log p(x , zm, z−m)]

    ],

    where zm ∼ qm(zm|x).Osvaldo Simeone ML4Engineers 140 / 149

  • Derivation of Update for Mean-Field VI

    The problem of interest, namely

    minqm(zm|x)

    fe(x , q(z |x) = qm(zm|x) · q(i−1)−m (z−m|x)

    ),

    can therefore be equivalently expressed as

    minqm(zm|x)

    Ezm∼qm(zm|x)[Ez−m [− log p(x , zm, z−m)]

    ]−H(qm(zm|x)).

    This corresponds to the free energy of a model with loss functiongiven by `(x , zm) = Ez−m [− log p(x , zm, z−m)], and hence the optimalsolution q∗m(zm|x) follows directly from the solution of the generalizedVI problem.

    Osvaldo Simeone ML4Engineers 141 / 149

  • Proof of the REINFORCE Gradient Formula

    By direct calculation, we have the following equalities

    ∇ϕEz∼q(z|x,ϕ)[g(z|ϕ)] =∑z

    ∇ϕ (q(z |x , ϕ)g(z |ϕ))

    =∑z

    (∇ϕq(z |x , ϕ))g(z |ϕ)

    +∑z

    q(z |x , ϕ)(∇ϕg(z |ϕ))

    =∑z

    q(z |x , ϕ)(∇ϕ log q(z |x , ϕ))g(z |ϕ)

    + Ez∼q(z|x,ϕ)[∇ϕg(z|ϕ)]=Ez∼q(z|x,ϕ)[g(z|ϕ) · ∇ϕ log q(z|x , ϕ)],

    where the sum is replaced by an integral for continuous distributions, and we haveused the identities ∇ϕ log q(z |x , ϕ) = (∇ϕq(z |x , ϕ))/q(z |x , ϕ) andEz∼q(z|x,ϕ)[∇ϕ log q(z |x , ϕ)] = 0, with the latter equality following from thezero-mean of the score vector (see Chapter 9).

    Osvaldo Simeone ML4Engineers 142 / 149

  • Variance of the REINFORCE Gradient

    Define ∇̂ϕ(z|c) := (g(z)− c) · ∇ϕ log q(z|ϕ) as the REINFORCE estimator,with z ∼ q(z |ϕ).The estimator is unbiased, i.e.,Ez∼q(z|ϕ)[∇̂ϕ(z|c)] = ∇ϕEz∼q(z|ϕ)[g(z)] =: ∇ϕ.Its variance is given as

    Var(∇̂ϕ(z|c)) = Ez∼q(z|ϕ)[||∇̂ϕ(z)−∇ϕ||2]= Ez∼q(z|ϕ)[||∇̂ϕ(z|c)||2]− ||∇ϕ||2,

    where

    Ez∼q(z|ϕ)[||∇̂ϕ(z|c)||2] = Ez∼q(z|ϕ)[(g(z)− c)2||∇ϕ log q(z|ϕ)||2].

    Osvaldo Simeone ML4Engineers 143 / 149

  • Variance of the REINFORCE Gradient

    What is the optimal value of c?

    The function Var(∇̂ϕ(z|c)) is convex in c , so differentiating withrespect to c yields

    ∂cVar(∇̂ϕ(z|c)) = −2Ez∼q(z|ϕ)[(g(z)− c)||∇ϕ log q(z|ϕ)||2]

    and the first-order optimality condition leads to the optimal solution

    c∗ =Ez∼q(z|ϕ)[g(z)||∇ϕ log q(z|ϕ)||2]Ez∼q(z|ϕ)[||∇ϕ log q(z|ϕ)||2]

    .

    If we approximate the expectations using the past iterates for z, thisyields the selection

    c(i) =1L

    ∑Lj=1 g(z

    (i−j))||∇ϕ log q(z(i−j)|ϕ(i−j))||21L

    ∑Lj=1 ||∇ϕ log q(z(i−j)|ϕ(i−j))||2

    .

    Osvaldo Simeone ML4Engineers 144 / 149