methods gibbs sampling and variational · gibbs sampling and variational methods ... mixture models...

Gibbs Sampling and VariationalMethods

Héctor Corrada Bravo

University of Maryland, College Park, USA CMSC 644: 20190403

Blei (2012), Comm. ACM

Mixture ModelsDocuments as mixtures of topics (Hoffman 1999, Blei et al. 2003)

1 / 48

Kovacevic (2014), PLOS One

Mixture ModelsMore applications: genetics, populations as mixture of ancestralpopulations

2 / 48

Glueck, et al. (2017). TVCG

Mixture ModelsMore applications: clinical subtyping

3 / 48

Schulman and Saria (2016). JMLR

Mixture ModelsMore applications: clinical prognosis

4 / 48

https://radimrehurek.com/gensim/

Mixture Models

Software

5 / 48

https://radimrehurek.com/gensim/

https://mcstan.org/

Mixture Models

Software

6 / 48

https://mc-stan.org/

Mixture ModelsWe have a set of documents

Each document modeled as a bagofwords (bow) over dictionary .

: the number of times word appears in document .

D

W

xw,d w ∈ W d ∈ D

7 / 48

Approximate Inference by SamplingUltimately, what we are interested in is learning topics

Perhaps instead of finding parameters that maximize likelihood

Sample from a distribution that gives us topic estimates

θ

Pr(θ|D)

8 / 48

Approximate Inference by SamplingUltimately, what we are interested in is learning topics

Perhaps instead of finding parameters that maximize likelihood

Sample from a distribution that gives us topic estimates

But, we only have talked about how can we sample parameters?

θ

Pr(θ|D)

Pr(D|θ)

9 / 48

Approximate Inference by SamplingLike EM, the trick here is to expand model with latent data

And sample from distribution

Zm

Pr(θ, Zm|Z)

10 / 48

Approximate Inference by SamplingLike EM, the trick here is to expand model with latent data

And sample from distribution

This is challenging, but sampling from and is easier

Zm

Pr(θ, Zm|Z)

Pr(θ|Zm, Z) Pr(Zm|θ, Z)

11 / 48

Approximate Inference by SamplingThe Gibbs Sampler does exactly that

Property: After some rounds, samples from the conditional distributions

Correspond to samples from marginal

Pr(θ|Zm, Z)

Pr(θ|Z) = ∑Z

m Pr(θ, Zm|Z)

12 / 48

Approximate Inference by SamplingQuick aside, how to simulate data for pLSA?

Generate parameters and Generate

{pd} {θt}

Δw,d,t

13 / 48

Approximate Inference by SamplingLet's go backwards, let's deal with Δw,d,t

14 / 48

Approximate Inference by SamplingLet's go backwards, let's deal with

Where was as given by Estep

Δw,d,t

Δw,d,t ∼ Multxw,d(γw,d,1, … , γw,d,T )

γw,d,t

15 / 48

Approximate Inference by SamplingLet's go backwards, let's deal with

Where was as given by Estep

for d in range(num_docs):

delta[d,w,:] = np.random.multinomial(doc_mat[d,w],

gamma[d,w,:])

Δw,d,t

Δw,d,t ∼ Multxw,d(γw,d,1, … , γw,d,T )

γw,d,t

16 / 48

Approximate Inference by SamplingHmm, that's a problem since we need ...

But, we know so, let's use that to generate each as

xw,d

Pr(w, d) = ∑t pt,dθw,t xw,d

xw,d ∼ Multnd(Pr(1, d), … ,Pr(W , d))

17 / 48

Approximate Inference by SamplingHmm, that's a problem since we need ...

But, we know so, let's use that to generate each as


doc_mat[d,:] = np.random.multinomial(nw[d], np.sum(p[:,d] * theta), axis=0)

xw,d

Pr(w, d) = ∑t pt,dθw,t xw,d

xw,d ∼ Multnd(Pr(1, d), … ,Pr(W , d))

18 / 48

Approximate Inference by SamplingNow, how about ? How do we generate the parameters of a Multinomialdistribution?

pd

19 / 48

Approximate Inference by SamplingNow, how about ? How do we generate the parameters of a Multinomialdistribution?

This is where the Dirichlet distribution comes in...

If , then

pd

pd ∼ Dir(α)

Pr(pd) ∝T

∏t=1

pαt−1t,d

20 / 48

Approximate Inference by SamplingSome interesting properties:

So, if we set all we will tend to have uniform probability over topics ( each on average)

If we increase it will also have uniform probability but will have verylittle variance (it will almost always be )

E[pt,d] =αt

∑t′ αt′

αt = 1

1/t

αt = 100

1/t

21 / 48

Approximate Inference by SamplingSo, we can say and pd ∼ Dir(α) θt ∼ Dir(β)

22 / 48

Approximate Inference by SamplingSo, we can say and

And generate data as (with )


p[:,d] = np.random.dirichlet(1. * np.ones(num_topics))

pd ∼ Dir(α) θt ∼ Dir(β)

αt = 1

23 / 48

Approximate Inference by SamplingSo what we have is a prior over parameters and : and

And we can formulate a distribution for missing data :

{pd} {θt} Pr(pd|α) Pr(θt|β)

Δw,d,t

Pr(Δw,d,t|pd, θt,α,β) =

Pr(Δw,d,t|pd, θt)Pr(pd|α)Pr(θt|β)

24 / 48

Approximate Inference by SamplingHowever, what we care about is the posterior distribution

What do we do???

Pr(pd|Δw,d,t, θt,α,β)

25 / 48

Approximate Inference by SamplingAnother neat property of the Dirichlet distribution is that it is conjugate tothe Multinomial

If and , thenθ|α ∼ Dir(α) X|θ ∼ Multinomial(θ)

θ|X, α ∼ Dir(X + α)

26 / 48

Approximate Inference by SamplingThat means we can sample from

and

pd

pt,d ∼ Dir(∑w

Δw,d,t + α)

θw,t ∼ Dir(∑d

Δw,d,t + β)

27 / 48

Blei, Ng, Jordan (2003), JMLR

Approximate Inference by SamplingCoincidentally, we have just specified the Latent Dirichlet Allocationmethod for topic modeling.

This is the most commonly used method for topic modeling

28 / 48

Approximate Inference by SamplingWe can now specify a full Gibbs Sampler for an LDA mixture model.

Given:

Worddocument counts Number of topics Prior parameters and

Do: Learn parameters and for topics

xw,d

K

α β

{pd} {θt} K

29 / 48

Approximate Inference by SamplingStep 0: Initialize parameters and

and

{pd} {θt}

pd ∼ Dir(α)

θt ∼ Dir(β)

30 / 48

Approximate Inference by SamplingStep 1:

Sample based on current parameters and Δw,d,t {pd} {θt}

Δw,d,. ∼ Multxw,d(γw,d,1, … , γw,d,T )

31 / 48


Sample parameters from

and

pt,d ∼ Dir(∑w

Δw,d,t + α)

θw,t ∼ Dir(∑d

Δw,d,t + β)

32 / 48


Get samples for a few iterations (e.g., 200), we want to reach astationary distribution...

33 / 48


Estimate as the average of the estimates from the last iterations(e.g., m=500)

Δ̂w,d,t m

34 / 48


Estimate parameters and based on estimated pd θt Δ̂w,d,t

p̂ t,d =∑w Δ̂w,d,t + α

∑t∑w Δ̂w,d,t + α

θ̂w,t =∑d Δ̂w,d,t + β

∑w∑d Δ̂w,d,t + β

35 / 48

Mixture modelsWe have now seen two different mixture models: soft kmeans and topicmodels

36 / 48


Two inference procedures:

Exact Inference with Maximum Likelihood using the EM algorithmApproximate Inference using Gibbs Sampling

37 / 48


Two inference procedures:

Exact Inference with Maximum Likelihood using the EM algorithmApproximate Inference using Gibbs Sampling

Next, we will go back to Maximum Likelihood but learn aboutApproximate Inference using Variational Methods

38 / 48

Variational MethodsConsider LDA model again

Benefits

Full document generative modelCan process new documents(posterior over topics) and words(prior parameters)

39 / 48


With Gibbs we sampled from

What if we want to estimateparameters again? (maximum aposteriori parameters)

Pr(θ, Δ|x, α, β)

40 / 48


Very difficult to maximize

Harder than pLSA due to Dirichletpriors

41 / 48


Let's get inspiration from EM:maximize lower bound

But what should the lower boundbe?

42 / 48


Make missing data and parameters"independent"!

43 / 48


Find parameters that make simplemodel most similar to original model

44 / 48

Variational MethodsCan then define EMlike algorithm

Estep: define expectation w.r.t. approximate distribution

Mstep: maximize parameters of approximate distribution

45 / 48

Variational MethodsNet result:

1) Maximum posterior estimates 2) Super simple updates 3) Withstochastic approach (update using a few words at a time), extremelyscalable

46 / 48

Hoffman, et al. (2010). NIPS

Variational Methods

47 / 48

ConclusionProbabilistic mixture models: powerful model class with manyapplications

Awesome historical algorithmic development

Outstanding software support

48 / 48