methods gibbs sampling and variational · gibbs sampling and variational methods ... mixture models...
TRANSCRIPT
Gibbs Sampling and VariationalMethods
Héctor Corrada Bravo
University of Maryland, College Park, USA CMSC 644: 20190403
Blei (2012), Comm. ACM
Mixture ModelsDocuments as mixtures of topics (Hoffman 1999, Blei et al. 2003)
1 / 48
Kovacevic (2014), PLOS One
Mixture ModelsMore applications: genetics, populations as mixture of ancestralpopulations
2 / 48
Glueck, et al. (2017). TVCG
Mixture ModelsMore applications: clinical subtyping
3 / 48
Schulman and Saria (2016). JMLR
Mixture ModelsMore applications: clinical prognosis
4 / 48
Mixture ModelsWe have a set of documents
Each document modeled as a bagofwords (bow) over dictionary .
: the number of times word appears in document .
D
W
xw,d w ∈ W d ∈ D
7 / 48
Approximate Inference by SamplingUltimately, what we are interested in is learning topics
Perhaps instead of finding parameters that maximize likelihood
Sample from a distribution that gives us topic estimates
θ
Pr(θ|D)
8 / 48
Approximate Inference by SamplingUltimately, what we are interested in is learning topics
Perhaps instead of finding parameters that maximize likelihood
Sample from a distribution that gives us topic estimates
But, we only have talked about how can we sample parameters?
θ
Pr(θ|D)
Pr(D|θ)
9 / 48
Approximate Inference by SamplingLike EM, the trick here is to expand model with latent data
And sample from distribution
Zm
Pr(θ, Zm|Z)
10 / 48
Approximate Inference by SamplingLike EM, the trick here is to expand model with latent data
And sample from distribution
This is challenging, but sampling from and is easier
Zm
Pr(θ, Zm|Z)
Pr(θ|Zm, Z) Pr(Zm|θ, Z)
11 / 48
Approximate Inference by SamplingThe Gibbs Sampler does exactly that
Property: After some rounds, samples from the conditional distributions
Correspond to samples from marginal
Pr(θ|Zm, Z)
Pr(θ|Z) = ∑Z
m Pr(θ, Zm|Z)
12 / 48
Approximate Inference by SamplingQuick aside, how to simulate data for pLSA?
Generate parameters and Generate
{pd} {θt}
Δw,d,t
13 / 48
Approximate Inference by SamplingLet's go backwards, let's deal with Δw,d,t
14 / 48
Approximate Inference by SamplingLet's go backwards, let's deal with
Where was as given by Estep
Δw,d,t
Δw,d,t ∼ Multxw,d(γw,d,1, … , γw,d,T )
γw,d,t
15 / 48
Approximate Inference by SamplingLet's go backwards, let's deal with
Where was as given by Estep
for d in range(num_docs):
delta[d,w,:] = np.random.multinomial(doc_mat[d,w],
gamma[d,w,:])
Δw,d,t
Δw,d,t ∼ Multxw,d(γw,d,1, … , γw,d,T )
γw,d,t
16 / 48
Approximate Inference by SamplingHmm, that's a problem since we need ...
But, we know so, let's use that to generate each as
xw,d
Pr(w, d) = ∑t pt,dθw,t xw,d
xw,d ∼ Multnd(Pr(1, d), … ,Pr(W , d))
17 / 48
Approximate Inference by SamplingHmm, that's a problem since we need ...
But, we know so, let's use that to generate each as
for d in range(num_docs):
doc_mat[d,:] = np.random.multinomial(nw[d], np.sum(p[:,d] * theta), axis=0)
xw,d
Pr(w, d) = ∑t pt,dθw,t xw,d
xw,d ∼ Multnd(Pr(1, d), … ,Pr(W , d))
18 / 48
Approximate Inference by SamplingNow, how about ? How do we generate the parameters of a Multinomialdistribution?
pd
19 / 48
Approximate Inference by SamplingNow, how about ? How do we generate the parameters of a Multinomialdistribution?
This is where the Dirichlet distribution comes in...
If , then
pd
pd ∼ Dir(α)
Pr(pd) ∝T
∏t=1
pαt−1t,d
20 / 48
Approximate Inference by SamplingSome interesting properties:
So, if we set all we will tend to have uniform probability over topics ( each on average)
If we increase it will also have uniform probability but will have verylittle variance (it will almost always be )
E[pt,d] =αt
∑t′ αt′
αt = 1
1/t
αt = 100
1/t
21 / 48
Approximate Inference by SamplingSo, we can say and pd ∼ Dir(α) θt ∼ Dir(β)
22 / 48
Approximate Inference by SamplingSo, we can say and
And generate data as (with )
for d in range(num_docs):
p[:,d] = np.random.dirichlet(1. * np.ones(num_topics))
pd ∼ Dir(α) θt ∼ Dir(β)
αt = 1
23 / 48
Approximate Inference by SamplingSo what we have is a prior over parameters and : and
And we can formulate a distribution for missing data :
{pd} {θt} Pr(pd|α) Pr(θt|β)
Δw,d,t
Pr(Δw,d,t|pd, θt,α,β) =
Pr(Δw,d,t|pd, θt)Pr(pd|α)Pr(θt|β)
24 / 48
Approximate Inference by SamplingHowever, what we care about is the posterior distribution
What do we do???
Pr(pd|Δw,d,t, θt,α,β)
25 / 48
Approximate Inference by SamplingAnother neat property of the Dirichlet distribution is that it is conjugate tothe Multinomial
If and , thenθ|α ∼ Dir(α) X|θ ∼ Multinomial(θ)
θ|X, α ∼ Dir(X + α)
26 / 48
Approximate Inference by SamplingThat means we can sample from
and
pd
pt,d ∼ Dir(∑w
Δw,d,t + α)
θw,t ∼ Dir(∑d
Δw,d,t + β)
27 / 48
Blei, Ng, Jordan (2003), JMLR
Approximate Inference by SamplingCoincidentally, we have just specified the Latent Dirichlet Allocationmethod for topic modeling.
This is the most commonly used method for topic modeling
28 / 48
Approximate Inference by SamplingWe can now specify a full Gibbs Sampler for an LDA mixture model.
Given:
Worddocument counts Number of topics Prior parameters and
Do: Learn parameters and for topics
xw,d
K
α β
{pd} {θt} K
29 / 48
Approximate Inference by SamplingStep 0: Initialize parameters and
and
{pd} {θt}
pd ∼ Dir(α)
θt ∼ Dir(β)
30 / 48
Approximate Inference by SamplingStep 1:
Sample based on current parameters and Δw,d,t {pd} {θt}
Δw,d,. ∼ Multxw,d(γw,d,1, … , γw,d,T )
31 / 48
Approximate Inference by SamplingStep 2:
Sample parameters from
and
pt,d ∼ Dir(∑w
Δw,d,t + α)
θw,t ∼ Dir(∑d
Δw,d,t + β)
32 / 48
Approximate Inference by SamplingStep 3:
Get samples for a few iterations (e.g., 200), we want to reach astationary distribution...
33 / 48
Approximate Inference by SamplingStep 4:
Estimate as the average of the estimates from the last iterations(e.g., m=500)
Δ̂w,d,t m
34 / 48
Approximate Inference by SamplingStep 5:
Estimate parameters and based on estimated pd θt Δ̂w,d,t
p̂ t,d =∑w Δ̂w,d,t + α
∑t∑w Δ̂w,d,t + α
θ̂w,t =∑d Δ̂w,d,t + β
∑w∑d Δ̂w,d,t + β
35 / 48
Mixture modelsWe have now seen two different mixture models: soft kmeans and topicmodels
36 / 48
Mixture modelsWe have now seen two different mixture models: soft kmeans and topicmodels
Two inference procedures:
Exact Inference with Maximum Likelihood using the EM algorithmApproximate Inference using Gibbs Sampling
37 / 48
Mixture modelsWe have now seen two different mixture models: soft kmeans and topicmodels
Two inference procedures:
Exact Inference with Maximum Likelihood using the EM algorithmApproximate Inference using Gibbs Sampling
Next, we will go back to Maximum Likelihood but learn aboutApproximate Inference using Variational Methods
38 / 48
Variational MethodsConsider LDA model again
Benefits
Full document generative modelCan process new documents(posterior over topics) and words(prior parameters)
39 / 48
Variational MethodsConsider LDA model again
With Gibbs we sampled from
What if we want to estimateparameters again? (maximum aposteriori parameters)
Pr(θ, Δ|x, α, β)
40 / 48
Variational MethodsConsider LDA model again
Very difficult to maximize
Harder than pLSA due to Dirichletpriors
41 / 48
Variational MethodsConsider LDA model again
Let's get inspiration from EM:maximize lower bound
But what should the lower boundbe?
42 / 48
Variational MethodsConsider LDA model again
Make missing data and parameters"independent"!
43 / 48
Variational MethodsConsider LDA model again
Find parameters that make simplemodel most similar to original model
44 / 48
Variational MethodsCan then define EMlike algorithm
Estep: define expectation w.r.t. approximate distribution
Mstep: maximize parameters of approximate distribution
45 / 48
Variational MethodsNet result:
1) Maximum posterior estimates 2) Super simple updates 3) Withstochastic approach (update using a few words at a time), extremelyscalable
46 / 48
Hoffman, et al. (2010). NIPS
Variational Methods
47 / 48
ConclusionProbabilistic mixture models: powerful model class with manyapplications
Awesome historical algorithmic development
Outstanding software support
48 / 48