statistical topic modelingmccallum/courses/gm2011/20-lda.pdftopic modeling three fundamental...
TRANSCRIPT
![Page 1: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/1.jpg)
Statistical Topic Modeling
Hanna M. WallachUniversity of Massachusetts Amherst
![Page 2: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/2.jpg)
Hanna M. Wallach :: UMass Amherst :: 2
Text as Data
● Structured and formal: e.g., publications, patents, press releases
● Messy and unstructured: e.g., chat logs, OCRed documents, transcripts
⇒ Large scale, robust methods for analyzing text
![Page 3: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/3.jpg)
Hanna M. Wallach :: UMass Amherst :: 3
Topic Modeling
● Three fundamental assumptions:
– Documents have latent semantic structure (“topics”)
– We can infer topics from word–document co-occurrences
– Can simulate this inference algorithmically
● Given a data set, the goal is to
– Learn the composition of the topics that best represent it
– Learn which topics are used in each document
![Page 4: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/4.jpg)
Hanna M. Wallach :: UMass Amherst :: 4
Latent Semantic Analysis (LSA)
● Based on ideas from linear algebra
● Form sparse term–document co-occurrence matrix X
– Raw counts or (more likely) TF-IDF weights
● Use SVD to decompose X into 3 matrices:
– U relates terms to “concepts”
– V relates “concepts” to documents
– Σ is a diagonal matrix of singular values
(Deerwester et al., 1990)
![Page 5: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/5.jpg)
Hanna M. Wallach :: UMass Amherst :: 5
Singular Value Decomposition
1. Latent semantic analysis (LSA) is a theory and method for ...2. Probabilistic latent semantic analysis is a probabilistic ...3. Latent Dirichlet allocation, a generative probabilistic model ...
0 0 11 1 00 0 10 0 11 1 11 0 10 2 11 1 0
...
1 2 3
allocationanalysisDirichlet
generativelatent
LSAprobabilistic
semantic...
![Page 6: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/6.jpg)
Hanna M. Wallach :: UMass Amherst :: 6
Generative Statistical Modeling
● Assume data was generated by a probabilistic model:
– Model may have hidden structure (latent variables)
– Model defines a joint distribution over all variables
– Model parameters are unknown
● Infer hidden structure and model parameters from data
● Situate new data in estimated model
![Page 7: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/7.jpg)
Hanna M. Wallach :: UMass Amherst :: 7
Topics and Words
human evolution disease computer
genome evolutionary host models
dna species bacteria information
genetic organisms diseases data
genes life resistance computers
sequence origin bacterial system
gene biology new network
molecular groups strains systems
sequencing phylogenetic control model
map living infectious parallel
... ... ... ...
pro
babili
ty
![Page 8: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/8.jpg)
Hanna M. Wallach :: UMass Amherst :: 8
Mixtures vs. Admixtures
topics
documents
“mixture”
topics
documents
“admixture”
![Page 9: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/9.jpg)
Hanna M. Wallach :: UMass Amherst :: 9
Documents and Topics
![Page 10: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/10.jpg)
Hanna M. Wallach :: UMass Amherst :: 10
Generative Process
probability
...
![Page 11: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/11.jpg)
Hanna M. Wallach :: UMass Amherst :: 11
Choose a Distribution Over Topics
probability
...
![Page 12: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/12.jpg)
Hanna M. Wallach :: UMass Amherst :: 12
Choose a Topic
probability
...
![Page 13: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/13.jpg)
Hanna M. Wallach :: UMass Amherst :: 13
Choose a Word
probability
...
![Page 14: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/14.jpg)
Hanna M. Wallach :: UMass Amherst :: 14
… And So On
probability
...
![Page 15: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/15.jpg)
Hanna M. Wallach :: UMass Amherst :: 15
Directed Graphical Models
● Nodes: random variables (latent or observed)
● Edges: probabilistic dependencies between variables
● Plates: “macros” that allow subgraphs to be replicated
![Page 16: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/16.jpg)
Hanna M. Wallach :: UMass Amherst :: 16
Probabilistic LSA
topics
observedword
document-specifictopic distribution
topicassignment
[Hofmann, '99]
![Page 17: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/17.jpg)
Hanna M. Wallach :: UMass Amherst :: 17
Strengths and Weaknesses
✔ Probabilistic graphical model: can be extended and embedded into other more complicated models
✗ Not a well-defined generative model: no way of generalizing to new, unseen documents
✗ Many free parameters: linear in # training documents
✗ Prone to overfitting: have to be careful when training
![Page 18: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/18.jpg)
Hanna M. Wallach :: UMass Amherst :: 18
Latent Dirichlet Allocation (LDA)
Dirichletdistribution
topics
observedword
document-specifictopic distribution
topicassignment
Dirichletdistribution
[Blei, Ng & Jordan, '03]
![Page 19: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/19.jpg)
Hanna M. Wallach :: UMass Amherst :: 19
Discrete Probability Distributions
● 3-dimensional discrete probability distributions can be visually represented in 2-dimensional space:
A
B C
![Page 20: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/20.jpg)
Hanna M. Wallach :: UMass Amherst :: 20
Dirichlet Distribution
● Distribution over discrete probability distributions:
base measure (mean)
scalar concentrationparameter
A
B C
...
≡
![Page 21: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/21.jpg)
Hanna M. Wallach :: UMass Amherst :: 21
Dirichlet Parameters
![Page 22: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/22.jpg)
Hanna M. Wallach :: UMass Amherst :: 22
Dirichlet Priors for LDA
asymmetric priors:nonuniform base measures
![Page 23: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/23.jpg)
Hanna M. Wallach :: UMass Amherst :: 23
Dirichlet Priors for LDA
symmetric priors:uniform base measures
![Page 24: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/24.jpg)
Hanna M. Wallach :: UMass Amherst :: 24
Real Data: Statistical Inference
probability
...
?
??
??
![Page 25: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/25.jpg)
Hanna M. Wallach :: UMass Amherst :: 25
Posterior Inference
● Infer or integrate out all latent variables, given tokens
latent variables
![Page 26: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/26.jpg)
Hanna M. Wallach :: UMass Amherst :: 26
Posterior Inference
latent variables
![Page 27: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/27.jpg)
Hanna M. Wallach :: UMass Amherst :: 27
Inference Algorithms
● Exact inference in LDA is not tractable
● Approximate inference algorithms:
– Mean field variational inference (Blei et al., 2001; 2003)
– Expectation propagation (Minka & Lafferty, 2002)
– Collapsed Gibbs sampling (Griffiths & Steyvers, 2002)
– Collapsed variational inference (Teh et al., 2006)
● Each method has advantages and disadvantages
![Page 28: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/28.jpg)
Hanna M. Wallach :: UMass Amherst :: 28
The End Result...
probability
...
![Page 29: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/29.jpg)
Hanna M. Wallach :: UMass Amherst :: 29
Evaluating LDA
● Unsupervised nature of LDA makes evaluation hard
● Compute probability of held-out documents:
– Classic way of evaluating generative models
– Often used to evaluate topic models
● Problem: have to approximate an intractable sum
![Page 30: Statistical Topic Modelingmccallum/courses/gm2011/20-lda.pdfTopic Modeling Three fundamental assumptions: – Documents have latent semantic structure (“topics”) – We can infer](https://reader036.vdocument.in/reader036/viewer/2022081402/605388cf9055ad47871ea97e/html5/thumbnails/30.jpg)
Hanna M. Wallach :: UMass Amherst :: 30
Computing Log Probability
● Simple importance sampling methods
● The “harmonic mean” method (Newton & Raftery, 1994)
– Known to overestimate, used anyway
● Annealed importance sampling (Neal, 2001)
– Prohibitively slow for large collections of documents
● Chib-style method (Murray & Salakhutdinov, 2009)
● “Left-to-Right” method (Wallach, 2008)
(Wallach et al., 2009)