latent dirichlet allocation - stanford universitystatweb.stanford.edu/~kriss1/lda_intro.pdf ·...
TRANSCRIPT
![Page 1: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/1.jpg)
Latent Dirichlet Allocation(Blei et al.)
Kris Sankaran
2016-11-14
1
![Page 2: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/2.jpg)
Introduction
2
![Page 3: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/3.jpg)
Agenda
I Generative Mechanism (15 minutes): What is the proposedmodel, and how does it differ from what existed before?
I Interpretations (10 minutes): What are alternative ways tounderstand the model?
I Model Inference (15 minutes): How would we fit this modelin practice?
I Examples and Conclusion (10 minutes): Why might we fitLDA in practice, and what are its limitations?
3
![Page 4: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/4.jpg)
Context and Motivation
I Motivated by topic modeling:I Building interpretable representations of text dataI Designing preprocessing steps for classification or information
retrieval
I This said, LDA is not necessarily tied to text analysisI Generative Modeling: Design unified probabilistic models
I Is explicit about assumptions, feels less ad hocI Gives access to (large) Bayesian inference literatureI Can be used as a module in larger probabilistic models
4
![Page 5: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/5.jpg)
Generative Model
5
![Page 6: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/6.jpg)
Latent Dirichlet Allocation
I For the nth word in document d ,
wdn|β, θd ∼ Cat (β·k)zdn|θd ∼ Cat (θd)
θd |α ∼ Dir (α)
I Mnemonics:I wdn ∈ {1, . . . ,V } is the term used as the nth word in document
dI zdn ∈ {1, . . . ,K } is the topic associated with the nth word in
document dI θd ∈ SK−1 are the topic mixture proportions for document dI β·k ∈ SV−1 are the term mixture proportions for topic kI α is the topic shrinkage parameter
6
![Page 7: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/7.jpg)
Latent Dirichlet Allocation
β
α θ z w
D
N
I w are observed dataI α,β are fixed, global parametersI θ, z are random, local parameters
7
![Page 8: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/8.jpg)
Observed Counts (sum of wdn’s)
word
doc
count
0
10
20
30
8
![Page 9: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/9.jpg)
Mixing Proportions (θd ’s)
topic
doc
theta
0.25
0.50
0.75
9
![Page 10: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/10.jpg)
Topic Counts (sum of zdn’s)
12
3
word
docu
men
t
z counts
10
20
30
10
![Page 11: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/11.jpg)
Latent Dirichlet Allocation (β)
topic
wor
d
beta
0.01
0.02
0.03
11
![Page 12: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/12.jpg)
Unigram Model
I It can be illustrative to compare with earlier topic modelingapproaches
I The unigram model draws all words from the same multinomial,wdn ∼ Cat (β).
w
D
N
β
12
![Page 13: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/13.jpg)
Mixture of UnigramsI This is the multinomial analog of gaussian mixture modelsI Each word is drawn from a mixture of K topics
zd ∼ p (z)wdn|zd ∼ Cat (βzdk )
I Topic assignment is drawn at the document level
β
z w
D
N
13
![Page 14: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/14.jpg)
Probabilistic Latent Semantic Indexing (pLSI)I pLSI draws a different topic for each word in the document,
zdn|d ∼ p (zdn|d)wdn|zdn ∼ Cat (β)
I The per-document topic mixture proportions are nonrandomand different for each document
I The number of fixed parameters grows linearly with the numberof documents
β
d z w
D
N
14
![Page 15: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/15.jpg)
Back to LDA
I Essential difference: Randomness in topic mixture proportionslets us share information across documents
I Number of fixed parameters does not grow with number ofdocuments.
β
α θ z w
D
N
15
![Page 16: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/16.jpg)
Interpretations
16
![Page 17: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/17.jpg)
GeometricI Each topic is a point on the simplex, and the K topics
determine a topics simplexI The mixture of unigrams model gives each document a corner
of the topics simplexI The pLSI estimates the empirical distribution of observed
mixing proportionsI LDA estimates a smooth density over the topics simplex
17
![Page 18: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/18.jpg)
Matrix FactorizationI We can think of topics as latent factors and mixing proportions
as document scores,
p (wdn = v |θd ,β) =K∑
k=1p (wdn|βvk) p (zdn = k)
= βTv ·p (zdn)
I The different models treat the p (zdn)’s differentlyI In LDA, this probability is βT
v ·θd .
p(wdn = v) θdk
βkv
=D
V K
18
![Page 19: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/19.jpg)
Inference
19
![Page 20: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/20.jpg)
Variational Bayes
I As scientists / modelers, our primary interest is in the posteriorp (θ, z |w ,α,β) after observing the words w
I This not available in closed form (the normalizing constant isintractable)
I In practice, we also need to estimate the α and β – more onthis later
20
![Page 21: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/21.jpg)
Variational Bayes
I (Blei et al.) propose a variational approachI Turns Bayesian inference into an optimization problem
I Specifically, consider the family of Γ of q’s that factor like
q (θ, z |γ,ϕ) =D∏
d=1
[Dir (θd |γd)
N∏n=1
Cat (zdn|ϕdn)
],
and try to identify,
argminq∈Γ
KL (q (θ, z ||γ,ϕ) ‖p (θ, z |w ,α,β))
21
![Page 22: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/22.jpg)
KL Minimization
I Note that
KL (q, p) = Eq
[log q (θ, z |γ,ϕ)
p (θ, z |w ,α,β)
]= −H (q) + log p (w |α,β) − Eq [p (θ, z ,w |α,β)] ,
and that the middle term (the “evidence”) is irrelevant to ouroptimization.
I Hence, find γ∗,ϕ∗ that maximize,
Eq [p (θ, z ,w |,α,β)] + H (q) ,
the “evidence lower bound” (ELBO).
22
![Page 23: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/23.jpg)
KL Minimization
I The ELBO can be written explicitly (though it’s not pretty),
D∑d=1
K∑k=1
(αk − 1)Eq [log (θdk) |γd ] +D∑
d=1
N∑n=1
K∑k=1
ϕdnkEq [log (θdk) |γd ] +
D∑d=1
N∑n=1
V∑v=1
I (wdn = v)ϕdnk logβvk−
D∑d=1
log Γ( K∑
k=1γdk
)+
K∑k=1
log Γ (γdk) −K∑
k=1(γdk − 1)Eq [log θdk |γd ] −
D∑d=1
N∑n=1
K∑k=1
ϕdnk logϕdnk
where we have omitted constants in γ,ϕ.
23
![Page 24: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/24.jpg)
KL Minimization
I The point is that we can perform coordinate ascent on theparameters ϕ and γ to find locally optimal ϕ∗ and γ∗
I The updates look like
ϕdnk ∝ βnwdn exp (Eq [log θd |γd ])
γdk = αk +
N∑n=1
ϕdnk
I InterpretationI First update is like p (zdn|wdn) ∝ p (wdn|zdn) p (zdn)I Second update is like Dirichlet posterior update upon observing
data ϕdnk .I ϕndk are the same across occurrences of the same term → save
memory
24
![Page 25: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/25.jpg)
Estimating α,β
I So far, we have assumed the fixed parameters α,β are known,when in practice they aren’t
I (Blei et al.) propose two approachesI Variational EM: Here, the ELBO takes the place of the usual
Expected Complete Log-Likelihood, and we alternate betweenoptimizing ϕdnk ,γd (Variational E-step) and α,β (VariationalM-step)
I Smoothed Variational Bayes: Place a Dirichlet prior on β andintroduce this to the Variational approximation. The VariationalM-step now only optimizes α.
I The Smoothed Bayesian approach is better when ML estimatesof β are unreliable (e.g., when data are sparse).
25
![Page 26: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/26.jpg)
Conclusion
26
![Page 27: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/27.jpg)
Examples(Blei et al.) compare approaches to a variety of topic modelingtasks,
I Directly fitting to Associated Press corpus, evaluated usingheld-out likelihood
I As preprocessing for classification on the Reuters dataI Collaborative filtering – evaluate likelihood on held-out movies,
instead of words
27
![Page 28: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/28.jpg)
Conclusion
I The basic LDA model can be easily extended by removingvarious exchangeability assumptions (D. M. Blei and J. D.Lafferty, D. Blei and J. Lafferty, Lacoste-Julien et al.)
I More generally, the three-level hierarchical Bayesian idea opensthe door to a variety of “mixed-membership” models (Airoldi etal., Erosheva and Fienberg, Mackey et al., Fox and Jordan)
I Alternative MCMC, Variational Inference, and Method ofMoments techniques are still an active area of research (M.Hoffman et al., Anandkumar et al., M. D. Hoffman et al., Tehet al.)
28
![Page 29: Latent Dirichlet Allocation - Stanford Universitystatweb.stanford.edu/~kriss1/lda_intro.pdf · Latent Dirichlet Allocation . ] Z ' 1 I w areobserveddata I , arefixed,globalparameters](https://reader030.vdocument.in/reader030/viewer/2022040205/5ed71cf7c30795314c1738be/html5/thumbnails/29.jpg)
ReferencesAiroldi, Edoardo M., et al. “Mixed Membership StochasticBlockmodels.” Journal of Machine Learning Research, vol. 9, no.Sep, 2008, pp. 1981–2014.Anandkumar, Anima, et al. “A Spectral Algorithm for LatentDirichlet Allocation.” Advances in Neural Information ProcessingSystems, 2012, pp. 917–925.Blei, David M., and John D. Lafferty. “Dynamic Topic Models.”Proceedings of the 23rd International Conference on MachineLearning, ACM, 2006, pp. 113–120.Blei, David M., et al. “Latent Dirichlet Allocation.” Journal ofMachine Learning Research, vol. 3, no. Jan, 2003, pp. 993–1022.Blei, David, and John Lafferty. “Correlated Topic Models.”Advances in Neural Information Processing Systems, vol. 18, MIT;1998, 2006, p. 147.Erosheva, Elena A., and Stephen E. Fienberg. “Bayesian MixedMembership Models for Soft Clustering and Classification.”Classification—The Ubiquitous Challenge, Springer, 2005, pp.11–26.Fox, Emily B., and Michael I. Jordan. “Mixed Membership Modelsfor Time Series.” ArXiv Preprint ArXiv:1309.3533, 2013.Hoffman, Matthew D., et al. “Stochastic Variational Inference.”Journal of Machine Learning Research, vol. 14, no. 1, 2013, pp.1303–1347.Hoffman, Matthew, et al. “Online Learning for Latent DirichletAllocation.” Advances in Neural Information Processing Systems,2010, pp. 856–864.Lacoste-Julien, Simon, et al. “DiscLDA: Discriminative Learning forDimensionality Reduction and Classification.” Advances in NeuralInformation Processing Systems, 2009, pp. 897–904.Mackey, Lester W., et al. “Mixed Membership Matrix Factorization.”Proceedings of the 27th International Conference on MachineLearning (ICML-10), 2010, pp. 711–718.Teh, Yee W., et al. “A Collapsed Variational Bayesian InferenceAlgorithm for Latent Dirichlet Allocation.” Advances in NeuralInformation Processing Systems, 2006, pp. 1353–1360.
29