plsa and lda
TRANSCRIPT
10/4/2018
1
Document and Topic Models:pLSA and LDA
Andrew Levandoski and Jonathan Lobo
CS 3750 Advanced Topics in Machine Learning
2 October 2018
Outline
โข Topic Models
โข pLSAโข LSAโข Modelโข Fitting via EMโข pHITS: link analysis
โข LDAโข Dirichlet distributionโข Generative processโข Modelโข Geometric Interpretationโข Inference
2
10/4/2018
2
Topic Models: Visual Representation
Topics DocumentsTopic proportions and
assignments
3
Topic Models: Importance
โข For a given corpus, we learn two things:1. Topic: from full vocabulary set, we learn important subsets
2. Topic proportion: we learn what each document is about
โข This can be viewed as a form of dimensionality reductionโข From large vocabulary set, extract basis vectors (topics)
โข Represent document in topic space (topic proportions)
โข Dimensionality is reduced from ๐ค๐ โ โค๐๐ to ๐ โ โ๐พ
โข Topic proportion is useful for several applications including document classification, discovery of semantic structures, sentiment analysis, object localization in images, etc.
4
10/4/2018
3
Topic Models: Terminology
โข Document Modelโข Word: element in a vocabulary setโข Document: collection of wordsโข Corpus: collection of documents
โข Topic Modelโข Topic: collection of words (subset of vocabulary)โข Document is represented by (latent) mixture of topics
โข ๐ ๐ค ๐ = ๐ ๐ค ๐ง ๐(๐ง|๐) (๐ง : topic)
โข Note: document is a collection of words (not a sequence)โข โBag of wordsโ assumptionโข In probability, we call this the exchangeability assumption
โข ๐ ๐ค1, โฆ , ๐ค๐ = ๐(๐ค๐ 1 , โฆ , ๐ค๐ ๐ ) (๐: permutation)
5
Topic Models: Terminology (contโd)
โข Represent each document as a vector space
โข A word is an item from a vocabulary indexed by {1,โฆ , ๐}. We represent words using unitโbasis vectors. The ๐ฃ๐กโ word is represented by a ๐ vector ๐ค such that ๐ค๐ฃ = 1 and ๐ค๐ข = 0 for ๐ฃ โ ๐ข.
โข A document is a sequence of ๐ words denoted by w = (๐ค1, ๐ค2, โฆ๐ค๐)where ๐ค๐ is the nth word in the sequence.
โข A corpus is a collection of ๐ documents denoted by ๐ท = ๐ค1, ๐ค2, โฆ๐ค๐ .
6
10/4/2018
4
Probabilistic Latent Semantic Analysis (pLSA)
7
Motivation
โข Learning from text and natural language
โข Learning meaning and usage of words without prior linguistic knowledge
โข Modeling semanticsโข Account for polysems and similar words
โข Difference between what is said and what is meant
8
10/4/2018
5
Vector Space Model
โข Want to represent documents and terms as vectors in a lower-dimensional space
โข N ร M word-document co-occurrence matrix ๐จ
โข limitations: high dimensionality, noisy, sparse
โข solution: map to lower-dimensional latent semantic space using SVD
๐ท = {๐1, . . . , ๐๐}
W = {๐ค1, . . . , ๐ค๐}
๐จ = ๐ ๐๐ , ๐ค๐๐๐
9
Latent Semantic Analysis (LSA)
โข Goalโข Map high dimensional vector space representation to lower dimensional
representation in latent semantic spaceโข Reveal semantic relations between documents (count vectors)
โข SVDโข N = UฮฃVT
โข U: orthogonal matrix with left singular vectors (eigenvectors of NNT)โข V: orthogonal matrix with right singular vectors (eigenvectors of NTN)โข ฮฃ: diagonal matrix with singular values of N
โข Select k largest singular values from ฮฃ to get approximation เทฉ๐ with minimal errorโข Can compute similarity values between document vectors and term vectors
10
10/4/2018
6
LSA
11
LSA Strengths
โข Outperforms naรฏve vector space model
โข Unsupervised, simple
โข Noise removal and robustness due to dimensionality reduction
โข Can capture synonymy
โข Language independent
โข Can easily perform queries, clustering, and comparisons
12
10/4/2018
7
LSA Limitations
โข No probabilistic model of term occurrences
โข Results are difficult to interpret
โข Assumes that words and documents form a joint Gaussian model
โข Arbitrary selection of the number of dimensions k
โข Cannot account for polysemy
โข No generative model
13
Probabilistic Latent Semantic Analysis (pLSA)
โข Difference between topics and words?โข Words are observable
โข Topics are not, they are latent
โข Aspect Modelโข Associates an unobserved latent class variable ๐ง ๐ โค = {๐ง1, . . . , ๐ง๐พ} with each
observation
โข Defines a joint probability model over documents and words
โข Assumes w is independent of d conditioned on z
โข Cardinality of z should be much less than than d and w
14
10/4/2018
8
pLSA Model Formulation
โข Basic Generative Modelโข Select document d with probability P(d)
โข Select a latent class z with probability P(z|d)
โข Generate a word w with probability P(w|z)
โข Joint Probability Model
๐ ๐,๐ค = ๐ ๐ ๐ ๐ค ๐ ๐ ๐ค|๐ =
๐ง ๐ โค
๐ ๐ค|๐ง ๐ ๐ง ๐
15
pLSA Graphical Model Representation
16
๐ ๐,๐ค = ๐ ๐ ๐ ๐ค ๐
๐ ๐ค|๐ =
๐ง ๐ โค
๐ ๐ค|๐ง ๐ ๐ง ๐๐ ๐, ๐ค =
๐ง ๐ โค
๐ ๐ง ๐ ๐ ๐ง ๐(๐ค|๐ง)
10/4/2018
9
pLSA Joint Probability Model
๐ ๐,๐ค = ๐ ๐ ๐ ๐ค ๐
๐ ๐ค|๐ =
๐ง ๐ โค
๐ ๐ค|๐ง ๐ ๐ง ๐
โ =
๐๐๐ท
๐ค๐๐
๐ ๐,๐ค log๐(๐, ๐ค)
Maximize:
Corresponds to a minimization of KL divergence (cross-entropy) between the empirical distribution of words and the model distribution P(w|d)
17
Probabilistic Latent Semantic Space
โข P(w|d) for all documents is approximated by a multinomial combination of all factors P(w|z)
โข Weights P(z|d) uniquely define a point in the latent semantic space, represent how topics are mixed in a document
18
10/4/2018
10
Probabilistic Latent Semantic Space
โข Topic represented by probability distribution over words
โข Document represented by probability distribution over topics
19
๐ง๐ = (๐ค1, . . . , ๐ค๐) ๐ง1 = (0.3, 0.1, 0.2, 0.3, 0.1)
๐๐ = (๐ง1, . . . , ๐ง๐) ๐1 = (0.5, 0.3, 0.2)
Model Fitting via Expectation Maximization
โข E-step
โข M-step
๐ ๐ง ๐,๐ค =๐ ๐ง ๐ ๐ ๐ง ๐ ๐ค ๐ง
ฯ๐งโฒ ๐ ๐งโฒ ๐ ๐ ๐งโฒ ๐(๐ค|๐งโฒ)
๐(๐ค|๐ง) =ฯ๐ ๐ ๐,๐ค ๐(๐ง|๐, ๐ค)
ฯ๐,๐คโฒ ๐ ๐, ๐คโฒ ๐(๐ง|๐, ๐คโฒ)
๐(๐|๐ง) =ฯ๐ค ๐ ๐,๐ค ๐(๐ง|๐, ๐ค)
ฯ๐โฒ,๐ค ๐ ๐โฒ, ๐ค ๐(๐ง|๐โฒ, ๐ค)
๐ ๐ง =1
๐
๐,๐ค
๐ ๐,๐ค ๐ ๐ง ๐, ๐ค , ๐ โก
๐,๐ค
๐(๐, ๐ค)
Compute posterior probabilities for latent variables z using current parameters
Update parameters using given posterior probabilities
20
10/4/2018
11
pLSA Strengths
โข Models word-document co-occurrences as a mixture of conditionally independent multinomial distributions
โข A mixture model, not a clustering model
โข Results have a clear probabilistic interpretation
โข Allows for model combination
โข Problem of polysemy is better addressed
21
pLSA Strengths
โข Problem of polysemy is better addressed
22
10/4/2018
12
pLSA Limitations
โข Potentially higher computational complexity
โข EM algorithm gives local maximum
โข Prone to overfittingโข Solution: Tempered EM
โข Not a well defined generative model for new documentsโข Solution: Latent Dirichlet Allocation
23
pLSA Model Fitting Revisited
โข Tempered EM โข Goals: maximize performance on unseen data, accelerate fitting process
โข Define control parameter ฮฒ that is continuously modified
โข Modified E-step
๐๐ฝ ๐ง ๐,๐ค =๐ ๐ง ๐ ๐ ๐ง ๐ ๐ค ๐ง ๐ฝ
ฯ๐งโฒ ๐ ๐งโฒ ๐ ๐ ๐งโฒ ๐ ๐ค ๐งโฒ ๐ฝ
24
10/4/2018
13
Tempered EM Steps
1) Split data into training and validation sets
2) Set ฮฒ to 1
3) Perform EM on training set until performance on validation set decreases
4) Decrease ฮฒ by setting it to ฮทฮฒ, where ฮท <1, and go back to step 3
5) Stop when decreasing ฮฒ gives no improvement
25
Example: Identifying Authoritative Documents
26
10/4/2018
14
HITS
โข Hubs and Authoritiesโข Each webpage has an authority score x and a hub score y
โข Authority โ value of content on the page to a community โข likelihood of being cited
โข Hub โ value of links to other pages โข likelihood of citing authorities
โข A good hub points to many good authorities
โข A good authority is pointed to by many good hubs
โข Principal components correspond to different communitiesโข Identify the principal eigenvector of co-citation matrix
27
HITS Drawbacks
โข Uses only the largest eigenvectors, not necessary the only relevant communities
โข Authoritative documents in smaller communities may be given no credit
โข Solution: Probabilistic HITS
28
10/4/2018
15
pHITS๐ ๐, ๐ =
๐ง
๐ ๐ง ๐ ๐ ๐ง ๐(๐|๐ง)
P(d|z) P(c|z)
Documents
Communities
Citations
29
Interpreting pHITS Results
โข Explain d and c in terms of the latent variable โcommunityโ
โข Authority score: P(c|z) โข Probability of a document being cited from within community z
โข Hub Score: P(d|z) โข Probability that a document d contains a reference to community z.
โข Community Membership: P(z|c). โข Classify documents
30
10/4/2018
16
Joint Model of pLSA and pHITS
โข Joint probabilistic model of document content (pLSA) and connectivity (pHITS) โข Able to answer questions on both structure and content
โข Model can use evidence about link structure to make predictions about document content, and vice versa
โข Reference flow โ connection between one topic and another
โข Maximize log-likelihood function
โ =
๐
๐ผ
๐
๐๐๐ฯ๐โฒ๐๐โฒ๐
๐๐๐
๐
๐ ๐ค๐ ๐ง๐ ๐ ๐ง๐ ๐๐ + (1 โ ๐ผ)
๐
๐ด๐๐ฯ๐โฒ ๐ด๐โฒ๐
๐๐๐
๐
๐ ๐๐ ๐ง๐ ๐ ๐ง๐ ๐๐
31
pLSA: Main Deficiencies
โข Incomplete in that it provides no probabilistic model at the document level i.e. no proper priors are defined.
โข Each document is represented as a list of numbers (the mixing proportions for topics), and there is no generative probabilistic model for these numbers, thus:
1. The number of parameters in the model grows linearly with the size of the corpus, leading to overfitting
2. It is unclear how to assign probability to a document outside of the training set
โข Latent Dirichlet allocation (LDA) captures the exchangeability of both words and documents using a Dirichlet distribution, allowing a coherent generative process for test data
32
10/4/2018
17
Latent Dirichlet Allocation (LDA)
33
LDA: Dirichlet Distribution
34
โข A โdistribution of distributionsโโข Multivariate distribution whose components all take values on
(0,1) and which sum to one.โข Parameterized by the vector ฮฑ, which has the same number of
elements (k) as our multinomial parameter ฮธ.โข Generalization of the beta distribution into multiple dimensionsโข The alpha hyperparameter controls the mixture of topics for a
given documentโข The beta hyperparameter controls the distribution of words per
topic
Note: Ideally we want our composites to be made up of only a few topics and our parts to belong to only some of the topics. With this in mind, alpha and beta are typically set below one.
10/4/2018
18
LDA: Dirichlet Distribution (contโd)
โข A k-dimensional Dirichlet random variable ๐ can take values in the (k-1)-simplex (a k-vector ๐ lies in the (k-1)-simplex if ๐๐ โฅ0,ฯ๐=1
๐ ๐๐ = 1) and has the following probability density on this simplex:
๐ ๐ ๐ผ =ฮ(ฯ๐=1
๐ ๐ผ๐)
ฯ๐=1๐ ฮ(๐ผ๐)
๐1๐ผ1โ1โฆ๐๐
๐ผ๐โ1,
where the parameter ๐ผ is a k-vector with components ๐ผ๐ > 0 and where ฮ(๐ฅ) is the Gamma function.
โข The Dirichlet is a convenient distribution on the simplex:โข In the exponential family
โข Has finite dimensional sufficient statistics
โข Conjugate to the multinomial distribution 35
LDA: Generative ProcessLDA assumes the following generative process for each document ๐ค in a corpus ๐ท:
1. Choose ๐ ~ ๐๐๐๐ ๐ ๐๐ ๐ .
2. Choose ๐ ~ ๐ท๐๐ ๐ผ .
3. For each of the ๐ words ๐ค๐:a. Choose a topic ๐ง๐ ~๐๐ข๐๐ก๐๐๐๐๐๐๐ ๐ .
b. Choose a word ๐ค๐ from ๐(๐ค๐|๐ง๐, ๐ฝ), a multinomial probability conditioned on the topic ๐ง๐.
Example: Assume a group of articles that can be broken down by three topics described by the following words:
โข Animals: dog, cat, chicken, nature, zoo
โข Cooking: oven, food, restaurant, plates, taste
โข Politics: Republican, Democrat, Congress, ineffective, divisive
To generate a new document that is 80% about animals and 20% about cooking:โข Choose the length of the article (say, 1000 words)
โข Choose a topic based on the specified mixture (~800 words will coming from topic โanimalsโ)
โข Choose a word based on the word distribution for each topic 36
10/4/2018
19
LDA: Model (Plate Notation)
๐ผ is the parameter of the Dirichlet prior on the per-document topic distribution,
๐ฝ is the parameter of the Dirichlet prior on the per-topic word distribution,
๐๐ is the topic distribution for document M,
๐ง๐๐ is the topic for the N-th word in document M, and
๐ค๐๐ is the word.
37
LDA: Model
38
Parameters of Dirichlet distribution(K-vector)
dk dn
ki
dn d
๐๐๐
๐ง๐๐ = {1,โฆ , ๐พ}
๐ฝ๐๐ = ๐(๐ค|๐ง)
1 โฆ topic โฆ K
1 โฆ nth word โฆ Nd
1 โฆ word idx โฆ V
1โฎ
docโฎ
M
1โฎ
docโฎ
M
1โฎ
topicโฎ
M
10/4/2018
20
LDA: Model (contโd)
39
controls the mixture of topics
controls the distribution of words per topic
LDA: Model (contโd)
Given the parameters ๐ผ and ๐ฝ, the joint distribution of a topic mixture ๐, a set of ๐ topics ๐ง, and a set of ๐words ๐ค is given by:
๐ ๐, ๐ง, ๐ค ๐ผ, ๐ฝ = ๐ ๐ ๐ผ เท
๐=1
๐
๐ ๐ง๐ ๐ ๐ ๐ค๐ ๐ง๐, ๐ฝ ,
where ๐ ๐ง๐ ๐ is ๐๐ for the unique ๐ such that ๐ง๐๐ = 1. Integrating over ๐ and summing over ๐ง, we obtain
the marginal distribution of a document:
๐ ๐ค ๐ผ, ๐ฝ = เถฑ๐(๐|๐ผ) เท
๐=1
๐
๐ง๐
๐ ๐ง๐ ๐ ๐(๐ค๐|๐ง๐, ๐ฝ) ๐๐๐ .
Finally, taking the products of the marginal probabilities of single documents, we obtain the probability of a corpus:
๐ ๐ท ๐ผ, ๐ฝ =เท
๐=1
๐
เถฑ๐(๐๐|๐ผ) เท
๐=1
๐๐
๐ง๐๐
๐ ๐ง๐๐ ๐๐ ๐(๐ค๐๐|๐ง๐๐, ๐ฝ) ๐๐๐ .
40
10/4/2018
21
LDA: Exchangeability
โข A finite set of random variables {๐ฅ1, โฆ , ๐ฅ๐} is said to be exchangeable if the joint distribution is invariant to permutation. If ฯ is a permutation of the integers from 1 to N:
๐ ๐ฅ1, โฆ , ๐ฅ๐ = ๐(๐ฅ๐1 , โฆ , ๐ฅ๐๐)
โข An infinite sequence of random numbers is infinitely exchangeable if every finite sequence is exchangeable
โข We assume that words are generated by topics and that those topics are infinitely exchangeable within a document
โข By De Finettiโs Theorem:
๐ ๐ค, ๐ง = เถฑ๐(๐) เท
๐=1
๐
๐ ๐ง๐ ๐ ๐(๐ค๐|๐ง๐) ๐๐
41
LDA vs. other latent variable models
42
Unigram model: ๐ ๐ค = ฯ๐=1๐ ๐(๐ค๐)
Mixture of unigrams: ๐ ๐ค = ฯ๐ง ๐(๐ง)ฯ๐=1๐ ๐(๐ค๐|๐ง)
pLSI: ๐ ๐,๐ค๐ = ๐(๐)ฯ๐ง ๐ ๐ค๐ ๐ง ๐(๐ง|๐)
10/4/2018
22
LDA: Geometric Interpretation
43
โข Topic simplex for three topics embedded in the word simplex for three words
โข Corners of the word simplex correspond to the three distributions where each word has probability one
โข Corners of the topic simplex correspond to three different distributions over words
โข Mixture of unigrams places each document at one of the corners of the topic simplex
โข pLSI induces an empirical distribution on the topic simplex denoted by diamonds
โข LDA places a smooth distribution on the topic simplex denoted by contour lines
LDA: Goal of Inference
LDA inputs: Set of words per document for each document in a corpus
LDA outputs: Corpus-wide topic vocabulary distributions
Topic assignments per word
Topic proportions per document44
?
10/4/2018
23
LDA: Inference
45
The key inferential problem we need to solve with LDA is that of computing the posterior distribution of the hidden variables given a document:
๐ ๐, ๐ง ๐ค, ๐ผ, ๐ฝ =๐(๐, ๐ง, ๐ค|๐ผ, ๐ฝ)
๐(๐ค|๐ผ, ๐ฝ)
This formula is intractable to compute in general (the integral cannot be solved in closed form), so to normalize the distribution we marginalize over the hidden variables:
๐ ๐ค ๐ผ, ๐ฝ =ฮ(ฯ๐ ๐ผ๐)
ฯ๐ ฮ(๐ผ๐)เถฑ เท
๐=1
๐
๐๐๐ผ๐โ1 เท
๐=1
๐
๐=1
๐
เท
๐=1
๐
(๐๐๐ฝ๐๐)๐ค๐๐
๐๐
LDA: Variational Inference
โข Basic idea: make use of Jensenโs inequality to obtain an adjustable lower bound on the log likelihood
โข Consider a family of lower bounds indexed by a set of variational parameters chosen by an optimization procedure that attempts to find the tightest possible lower bound
โข Problematic coupling between ๐ and ๐ฝ arises due to edges between ๐, z and w. By dropping these edges and the w nodes, we obtain a family of distributions on the latent variables characterized by the following variational distribution:
๐ ๐, ๐ง ๐พ, ๐ = ๐ ๐ ๐พ เท
๐=1
๐
๐ ๐ง๐ ๐๐
where ๐พ and (๐1, โฆ , ๐๐) and the free variational parameters.
46
10/4/2018
24
LDA: Variational Inference (contโd)โข With this specified family of probability distributions, we set up the following
optimization problem to determine ๐ and ๐:๐พโ, ๐โ = ๐๐๐๐๐๐ ๐พ,๐ ๐ท(๐(๐, ๐ง|๐พ, ๐) โฅ ๐ ๐, ๐ง ๐ค, ๐ผ, ๐ฝ )
โข The optimizing values of these parameters are found by minimizing the KL divergence between the variational distribution and the true posterior ๐ ๐, ๐ง ๐ค, ๐ผ, ๐ฝ
โข By computing the derivatives of the KL divergence and setting them equal to zero, we obtain the following pair of update equations:
๐๐๐ โ ๐ฝ๐๐ค๐exp ๐ธ๐ log ๐๐ ๐พ
๐พ๐ = ๐ผ๐ +๐=1
๐
๐๐๐
โข The expectation in the multinomial update can be computed as follows:
๐ธ๐ log ๐๐ ๐พ = ฮจ ๐พ๐ โฮจ(เท๐=1
๐
๐พ๐)
where ฮจ is the first derivative of the logฮ function.47
LDA: Variational Inference (contโd)
48
10/4/2018
25
LDA: Parameter Estimation
โข Given a corpus of documents ๐ท = {๐ค1, ๐ค2โฆ ,๐ค๐}, we wish to find ๐ผand ๐ฝ that maximize the marginal log likelihood of the data:
โ ๐ผ, ๐ฝ =
๐=1
๐
๐๐๐๐(๐ค๐|๐ผ, ๐ฝ)
โข Variational EM yields the following iterative algorithm:1. (E-step) For each document, find the optimizing values of the variational
parameters ๐พ๐โ , ๐๐
โ : ๐ โ ๐ท
2. (M-step) Maximize the resulting lower bound on the log likelihood with respect to the model parameters ๐ผ and ๐ฝ
These two steps are repeated until the lower bound on the log likelihood converges.
49
LDA: Smoothing
50
โข Introduces Dirichlet smoothing on ๐ฝ to avoid the zero frequency word problem
โข Fully Bayesian approach:
๐ ๐ฝ1:๐ , ๐ง1:๐, ๐1:๐ ๐, ๐, ๐พ =
๐=1
๐
๐ท๐๐(๐ฝ๐|๐๐)เท
๐=1
๐
๐๐(๐๐ , ๐ง๐|๐๐ , ๐พ๐)
where ๐๐(๐, ๐ง |๐, ๐พ) is the variational distribution defined for LDA. We require an additional update for the new variational parameter ๐:
๐๐๐ = ๐ +
๐=1
๐
๐=1
๐๐
๐๐๐๐โ ๐ค๐๐
๐
10/4/2018
26
Topic Model Applications
โข Information Retrieval
โข Visualization
โข Computer Visionโข Document = image, word = โvisual wordโ
โข Bioinformaticsโข Genomic features, gene sequencing, diseases
โข Modeling networksโข cities, social networks
51
pLSA / LDA Libraries
โข gensim (Python)
โข MALLET (Java)
โข topicmodels (R)
โข Stanford Topic Modeling Toolbox
10/4/2018
27
References
David M. Blei, Andrew Y. Ng, Michael I. Jordan. Latent Dirichlet Allocation. JMLR,2003.
David Cohn and Huan Chang. Learning to probabilistically identify Authoritativedocuments. ICML, 2000.
David Cohn and Thomas Hoffman. The Missing Link - A Probabilistic Model ofDocument Content and Hypertext Connectivity. NIPS, 2000.
Thomas Hoffman. Probabilistic Latent Semantic Analysis. UAI-99, 1999.
Thomas Hofmann. Probabilistic Latent Semantic Indexing. SIGIR-99, 1999.
53