fast and accurate inference for topic models
DESCRIPTION
Fast and Accurate Inference for Topic Models. James Foulds University of California, Santa Cruz Presented at eBay Research Labs. Motivation. There is an ever-increasing wealth of digital information available Wikipedia News articles Scientific articles Literature Debates - PowerPoint PPT PresentationTRANSCRIPT
Fast and Accurate Inference for Topic Models
James FouldsUniversity of California, Santa Cruz
Presented at eBay Research Labs
2
Motivation• There is an ever-increasing wealth of digital
information available– Wikipedia– News articles– Scientific articles– Literature– Debates– Blogs, social media …
• We would like automatic methods to help us understand this content
3
Motivation
• Personalized recommender systems• Social network analysis• Exploratory tools for scientists• The digital humanities• …
4
The Digital Humanities
5
Dimensionality reduction
The quick brown fox jumps over the sly lazy dog
6
Dimensionality reduction
The quick brown fox jumps over the sly lazy dog[5 6 37 1 4 30 9 22 570 12]
7
Dimensionality reduction
The quick brown fox jumps over the sly lazy dog[5 6 37 1 4 30 9 22 570 12]
Foxes Dogs Jumping[40% 40% 20% ]
8
Latent Variable Models
Z
XΦParameters
Latent variables
Observed dataData Points
Dimensionality(X) >> dimensionality(Z)Z is a bottleneck, which finds a compressed, low-dimensional representation of X
Latent Feature Models forSocial Networks
Alice Bob
Claire
Latent Feature Models forSocial Networks
CyclingFishingRunning
WaltzRunning
TangoSalsa
Alice Bob
Claire
Latent Feature Models forSocial Networks
CyclingFishingRunning
WaltzRunning
TangoSalsa
Alice Bob
Claire
Latent Feature Models forSocial Networks
CyclingFishingRunning
WaltzRunning
TangoSalsa
Alice Bob
Claire
Miller, Griffiths, Jordan (2009)Latent Feature Relational Model
CyclingFishingRunning
WaltzRunning
TangoSalsa
Cycling Fishing Running Tango Salsa Waltz
Alice
Bob
ClaireZ =
Alice Bob
Claire
14
Latent Representations
• Binary latent feature
• Latent class
• Mixed membership
Cycling Fishing Running Tango Salsa WaltzAlice 1 1 1Bob 1 1Claire 1 1
Cycling Fishing Running Tango Salsa WaltzAlice 0.2 0.4 0.4Bob 0.5 0.5Claire 0.9 0.1
Cycling Fishing Running Tango Salsa WaltzAlice 1Bob 1Claire 1
15
Latent Representations
• Binary latent feature
• Latent class
• Mixed membership
Cycling Fishing Running Tango Salsa WaltzAlice 1 1 1Bob 1 1Claire 1 1
Cycling Fishing Running Tango Salsa WaltzAlice 0.2 0.4 0.4Bob 0.5 0.5Claire 0.9 0.1
Cycling Fishing Running Tango Salsa WaltzAlice 1Bob 1Claire 1
16
Latent Representations
• Binary latent feature
• Latent class
• Mixed membership
Cycling Fishing Running Tango Salsa WaltzAlice 1 1 1Bob 1 1Claire 1 1
Cycling Fishing Running Tango Salsa WaltzAlice 0.2 0.4 0.4Bob 0.5 0.5Claire 0.9 0.1
Cycling Fishing Running Tango Salsa WaltzAlice 1Bob 1Claire 1
17
Latent Variable ModelsAs Matrix Factorization
18
Latent Variable ModelsAs Matrix Factorization
Miller, Griffiths, Jordan (2009)Latent Feature Relational Model
CyclingFishingRunning
WaltzRunning
TangoSalsa
Cycling Fishing Running Tango Salsa Waltz
Alice
Bob
ClaireZ =
Alice Bob
Claire
Miller, Griffiths, Jordan (2009)Latent Feature Relational Model
CyclingFishingRunning
WaltzRunning
TangoSalsa
Cycling Fishing Running Tango Salsa Waltz
Alice
Bob
ClaireZ =
Alice Bob
Claire E[Y] =(ZWZT)
21
Topics
Topic 1Reinforcement learning
Topic 2Learning algorithms
Topic 3Character recognition
Distributionover allwords indictionary
A vector of discrete probabilities (sums to one)
22
Topics
Topic 1Reinforcement learning
Topic 2Learning algorithms
Topic 3Character recognition
Top 10 words
23
Topics
Topic 1Reinforcement learning
Topic 2Learning algorithms
Topic 3Character recognition
Top 10 words
24
Latent Dirichlet Allocation(Blei et al., 2003)
•For each document d• Draw its topic proportion θ(d) ~ Dirichlet(α)• For each word wd,n
• Draw a topic assignment zd,n ~ Discrete(θ(d))• Draw a word from the chosen topic wd,n ~ Discrete(φZd,n)
φ
25
Latent Dirichlet Allocation(Blei et al., 2003)
•For each topic k• Draw its distribution over words φ(k) ~ Dirichlet(β)• For each word wd,n
• Draw a topic assignment zd,n ~ Discrete(θ(d))• Draw a word from the chosen topic wd,n ~ Discrete(φZd,n)
φ
26
Latent Dirichlet Allocation(Blei et al., 2003)
•For each document d• Draw its topic proportion θ(d) ~ Dirichlet(α)• For each word wd,n
• Draw a topic assignment zd,n ~ Discrete(θ(d))• Draw a word from the chosen topic wd,n ~ Discrete(φZd,n)
φ
27
Latent Dirichlet Allocation(Blei et al., 2003)
•For each document d• Draw its topic proportion θ(d) ~ Dirichlet(α)• For each word wd,n
• Draw a topic assignment zd,n ~ Discrete(θ(d))• Draw a word from the chosen topic wd,n ~ Discrete(φZd,n)
φ
28
Latent Dirichlet Allocation(Blei et al., 2003)
•For each document d• Draw its topic proportion θ(d) ~ Dirichlet(α)• For each word wd,n
• Draw a topic assignment zd,n ~ Discrete(θ(d))• Draw a word from the chosen topic wd,n ~ Discrete(φZd,n)
φ
29
Latent Dirichlet Allocation(Blei et al., 2003)
•For each document d• Draw its topic proportion θ(d) ~ Dirichlet(α)• For each word wd,n
• Draw a topic assignment zd,n ~ Discrete(θ(d))• Draw a word from the chosen topic wd,n ~ Discrete(φZd,n)
φ
30
Latent Dirichlet Allocation(Blei et al., 2003)
•For each document d• Draw its topic proportion θ(d) ~ Dirichlet(α)• For each word wd,n
• Draw a topic assignment zd,n ~ Discrete(θ(d))• Draw a word from the chosen topic wd,n ~ Discrete(φZd,n)
φ
31
LDA as Matrix Factorization
θ φTx
32
Let’s say we want to build an LDAtopic model on Wikipedia
33
LDA on Wikipedia
102
103
104
105
-780
-760
-740
-720
-700
-680
-660
-640
-620
-600
Time (s)
Avg
. Log
Lik
elih
ood
VB (10,000 documents)
1 hour 6 hours
12 hours
10 mins
34
LDA on Wikipedia
102
103
104
105
-780
-760
-740
-720
-700
-680
-660
-640
-620
-600
Time (s)
Avg
. Log
Lik
elih
ood
VB (10,000 documents)
VB (100,000 documents)
1 hour 6 hours
12 hours
10 mins
35
LDA on Wikipedia
102
103
104
105
-780
-760
-740
-720
-700
-680
-660
-640
-620
-600
Time (s)
Avg
. Log
Lik
elih
ood
VB (10,000 documents)
VB (100,000 documents)
1 full iteration = 3.5 days!
1 hour 6 hours
12 hours
10 mins
36
LDA on Wikipedia
Stochastic variational inference
102
103
104
105
-780
-760
-740
-720
-700
-680
-660
-640
-620
-600
Time (s)
Avg
. Log
Lik
elih
ood
Stochastic VB (all documents)
VB (10,000 documents)
VB (100,000 documents)
Stochastic variational inference
1 hour 6 hours
12 hours
10 mins
37
LDA on Wikipedia
Stochastic collapsed variational inference
102
103
104
105
-780
-760
-740
-720
-700
-680
-660
-640
-620
-600
Time (s)
Avg
. Log
Lik
elih
ood
SCVB0 (all documents)Stochastic VB (all documents)VB (10,000 documents)VB (100,000 documents)
1 hour 6 hours
12 hours
10 mins
38
Available tools
VB Collapsed Gibbs Sampling Collapsed VB
Batch Blei et al. (2003) Griffiths and Steyvers (2004)
Teh et al. (2007), Asuncion et al.
(2009)
Stochastic Hoffman et al. (2010, 2013)
Mimno et al. (2012) (partially collapsed VB/Gibbs hybrid)
???
39
Available tools
VB Collapsed Gibbs Sampling Collapsed VB
Batch Blei et al. (2003) Griffiths and Steyvers (2004)
Teh et al. (2007), Asuncion et al.
(2009)
Stochastic Hoffman et al. (2010, 2013)
Mimno et al. (2012) (partially collapsed VB/Gibbs hybrid)
???
40
Collapsed Inference for LDAGriffiths and Steyvers (2004)
• Marginalize out the parameters, and perform inference on the latent variables only
Z
𝛉
𝚽 Z
41
Collapsed Inference for LDAGriffiths and Steyvers (2004)
• Marginalize out the parameters, and perform inference on the latent variables only
– Simpler, faster and fewer update equations– Better mixing for Gibbs sampling
42
• Collapsed Gibbs sampler
Collapsed Inference for LDAGriffiths and Steyvers (2004)
43
• Collapsed Gibbs sampler
Collapsed Inference for LDAGriffiths and Steyvers (2004)
Word-topic counts
44
• Collapsed Gibbs sampler
Collapsed Inference for LDAGriffiths and Steyvers (2004)
Document-topic counts
45
• Collapsed Gibbs sampler
Collapsed Inference for LDAGriffiths and Steyvers (2004)
Topic counts
46
Stochastic Optimization for ML
Stochastic algorithms– While (not converged)
• Process a subset of the dataset, to estimate the update• Update parameters
47
Stochastic Optimization for ML
• Stochastic gradient descent– Estimate the gradient
• Stochastic variational inference(Hoffman et al. 2010, 2013)– Estimate the natural gradient of the variational
parameters• Online EM (Cappe and Moulines, 2009)
– Estimate E-step sufficient statistics
48
Goal: Build a Fast, Accurate,Scalable Algorithm for LDA
• Collapsed LDA– Easy to implement– Fast– Accurate– Mixes well / propagates information quickly
• Stochastic algorithms– Scalable
• Quickly forgets random initialization• Memory requirements, update time independent of size of data set• Can estimate topics before a single pass of the data is complete
• Our contribution: an algorithm which gets the best of both worlds
49
Variational Bayesian Inference
• An optimization strategy for performing posterior inference, i.e. estimating Pr(Z|X)
P
Q
50
Variational Bayesian Inference
• An optimization strategy for performing posterior inference, i.e. estimating Pr(Z|X)
KL(Q || P)
P
Q
51
Variational Bayesian Inference
• An optimization strategy for performing posterior inference, i.e. estimating Pr(Z|X)
KL(Q || P)P
Q
52
Collapsed Variational Bayes(Teh et al., 2007)
• K-dimensional discrete variational distributions for each token
53
Collapsed Variational Bayes(Teh et al., 2007)
• K-dimensional discrete variational distributions for each token
• Mean field assumption
54
Collapsed Variational Bayes(Teh et al., 2007)
• K-dimensional discrete variational distributions for each token
• Mean field assumption
• Improved variational bound
55
Collapsed VBMean field assumption
The Quick Brown Fox Jumped Over
Foxes 0.33 0.5 0.5 1 0 0.2
Dogs 0.33 0.3 0.5 0 0 0.2
Jumping 0.33 0.2 0 0 1 0.6
Words
Topics
56
• Collapsed Gibbs sampler
Collapsed Variational Bayes(Teh et al., 2007)
The Quick Brown Fox Jumped Over
Foxes 0 1 1 1 0 0
Dogs 1 0 0 0 0 0
Jumping 0 0 0 0 1 1
57
• Collapsed Gibbs sampler
• CVB0 (Asuncion et al., 2009)
Collapsed Variational Bayes(Teh et al., 2007)
58
• CVB0 (Asuncion et al., 2009)
Collapsed Variational Bayes(Teh et al., 2007)
The Quick Brown Fox Jumped Over
Foxes 0.33 0.5 0.5 1 0 0.2
Dogs 0.33 0.3 0.5 0 0 0.2
Jumping 0.33 0.2 0 0 1 0.6
59
• CVB0 (Asuncion et al., 2009)
Collapsed Variational Bayes(Teh et al., 2007)
The Quick Brown Fox Jumped Over
Foxes 0.33 0.5 0.5 1 0 0.2
Dogs 0.33 0.3 0.5 0 0 0.2
Jumping 0.33 0.2 0 0 1 0.6
60
• CVB0 (Asuncion et al., 2009)
Collapsed Variational Bayes(Teh et al., 2007)
The Quick Brown Fox Jumped Over
Foxes 0.33 0.5 0.9 1 0 0.2
Dogs 0.33 0.3 0.1 0 0 0.2
Jumping 0.33 0.2 0 0 1 0.6
61
CVB0 Statistics
• Simple sums over the variational parameters
62
Stochastic Optimization for ML
• Stochastic gradient descent– Estimate the gradient
• Stochastic variational inference(Hoffman et al. 2010, 2013)– Estimate the natural gradient of the variational parameters
• Online EM (Cappe and Moulines, 2009)– Estimate E-step sufficient statistics
• Stochastic CVB0– Estimate the CVB0 statistics
63
Stochastic Optimization for ML
• Stochastic gradient descent– Estimate the gradient
• Stochastic variational inference(Hoffman et al. 2010, 2013)– Estimate the natural gradient of the variational parameters
• Online EM (Cappe and Moulines, 2009)– Estimate E-step sufficient statistics
• Stochastic CVB0– Estimate the CVB0 statistics
64
Estimating CVB0 Statistics
65
Estimating CVB0 Statistics
• Pick a random word i from a random document j
66
Estimating CVB0 Statistics
• Pick a random word i from a random document j
• An unbiased estimator is:
67
Stochastic CVB0
• In an online algorithm, we cannot store the variational parameters
• But we can update them!
68
Stochastic CVB0
• Keep an online average of the CVB0 statistics
69
Extra Refinements
• Optional burn-in passes per document
• Minibatches
• Operating on sparse counts
70
Stochastic CVB0Putting it all Together
71
Experimental Results – Large Scale
72
Experimental Results – Large Scale
73
Experimental Results – Small Scale
• Real-time or near real-time results are important for EDA applications
• Human participants shown the top ten words from each topic
74
Experimental Results – Small Scale
0
0.5
1
1.5
2
2.5
3
3.5
4
4.5
SCVB0
SVB
NIPS (5 Seconds) New York Times (60 Seconds)
Mean number of errors
Standard deviations: 1.1 1.2 1.0 2.4
75
Convergence Analysis
• Theorem: with an appropriate sequence of step sizes, SCVB0 converges to a stationary point of the MAP, with adjusted hyper-parameters
76
Convergence Analysis
• Step 1) An alternative derivation of “batch SCVB0” as an EM algorithm for MAP
EM statistics:
E-step responsibilites
77
Convergence Analysis
• Step 1) An alternative derivation of “batch SCVB0” as an EM algorithm for MAP
EM statistics:
E-step:
Equivalent to SCVB0 update, but withhyper-parameters adjusted by one
78
Convergence Analysis
• Step 1) An alternative derivation of “batch SCVB0” as an EM algorithm for MAP
EM statistics:
M-step:
E-step:
Synchronize parameters (estimated EM statistics)with the EM statistics
79
Convergence Analysis
• Step 2) Stochastic CVB0 is a Robbins Monro stochastic approximation algorithm for finding the fixed points of this EM algorithm
80
Convergence Analysis
• Step 2) Stochastic CVB0 is a Robbins Monro stochastic approximation algorithm for finding the fixed points of this EM algorithm
Goal: Find the roots of a function
81
Convergence Analysis
• Step 2) Stochastic CVB0 is a Robbins Monro stochastic approximation algorithm for finding the fixed points of this EM algorithm
Observe noisy measurement
82
Convergence Analysis
• Step 2) Stochastic CVB0 is a Robbins Monro stochastic approximation algorithm for finding the fixed points of this EM algorithm
Observe noisy measurement
Move in the direction of the noisy measurement
83
Convergence Analysis
• Step 2) Stochastic CVB0 is a Robbins Monro stochastic approximation algorithm for finding the fixed points of this EM algorithm
The step that the EM algorithm takes
84
Convergence Analysis
• Step 3) Show that the stochastic approximation algorithm converges
• A Lyapunov function is an “objective function” for an SA algorithm.
• The existence of such a function, with certain conditions holding, is sufficient for convergence with an appropriate sequence of step sizes
85
Convergence Analysis
• Step 3) Show that the stochastic approximation algorithm converges
• A Lyapunov function is an “objective function” for an SA algorithm.
• The existence of such a function, with certain properties holding, is sufficient for convergence with an appropriate sequence of step sizes
• We show that (the negative of the Lagrangian of)
the EM lower bound is such a Lyapunov function
86
Convergence Analysis
• Step 3) Show that the stochastic approximation algorithm converges
• A Lyapunov function is an “objective function” for an SA algorithm.
• The existence of such a function, with certain properties holding, is sufficient for convergence with an appropriate sequence of step sizes
• We show that (the negative of the Lagrangian of)
the EM lower bound is such a Lyapunov function
87
Future work
• Exploit sparsity
• Parallelization
• Nonparametric extensions
• Generalizations to other models?
88
Probabilistic Soft Logic(Lise Getoor’s research group, see psl.cs.umd.edu )
User-specified logical rules
89
Probabilistic Soft Logic(Lise Getoor’s research group, see psl.cs.umd.edu )
User-specified logical rules
Probabilistic model
90
Probabilistic Soft Logic(Lise Getoor’s research group, see psl.cs.umd.edu )
User-specified logical rules
Probabilistic model
Fast inference
91
Probabilistic Soft Logic(Lise Getoor’s research group, see psl.cs.umd.edu )
User-specified logical rules
Probabilistic model
Fast inference
Structured predictionEntity resolution
Collective classification
…
Link prediction
92
Publications from my Thesis Work
Algorithm papers• J. R. Foulds, L. Boyles, C. DuBois, P. Smyth and M. Welling. Stochastic collapsed variational
Bayesian inference for latent Dirichlet allocation. KDD 2013.
• J. R. Foulds and P. Smyth. Annealing Paths for the Evaluation of Topic Models. UAI 2014.
Modeling papers• J. R. Foulds, P. Smyth. Modeling scientific impact with topical influence regression. EMNLP
2013.
• J. R. Foulds, A. Asuncion, C. DuBois, C. T. Butts, P. Smyth. A dynamic relational infinite feature model for longitudinal social networks. AI STATS 2011
93
Other publications• C. DuBois, J. R. Foulds, P. Smyth. Latent set models for two-mode network data. ICWSM 2011.
• J. R. Foulds, N. Navaroli, P. Smyth, A. Ihler. Revisiting MAP estimation, message passing and perfect graphs. AI STATS 2011.
• J. R. Foulds and P. Smyth. Multi-instance mixture models and semi-supervised learning. SIAM SDM 2011.
• J. R. Foulds and E. Frank. Speeding up and boosting diverse density learning. Discovery Science, 2010.
• J. R. Foulds and E. Frank. A review of multi-instance learning assumptions. Knowledge Engineering Review, 25(1), 2010.
• J. R. Foulds and E. Frank. Revisiting multiple-instance learning via embedded instance selection. Australasian Joint Conference on Artificial Intelligence, 2008.
• J. R. Foulds and L. R. Foulds, A probabilistic dynamic programming model of rape seed harvesting. International Journal of Operational Research 2006, 1(4), 2006.
• J. R. Foulds and L. R. Foulds, Bridge lane direction specification for sustainable traffic management. Asia-Pacific Journal of Operational Research, 23(2), 2006.
94
Thanks to my Collaborators
• My PhD advisor, Padhraic Smyth
• SCVB0 is also joint work with:– Levi Boyles– Chris DuBois– Max Welling
95
Thank You!
Questions?