fast and accurate inference for topic models

Fast and Accurate Inference for Topic Models

James FouldsUniversity of California, Santa Cruz

Presented at eBay Research Labs

2

Motivation• There is an ever-increasing wealth of digital

information available– Wikipedia– News articles– Scientific articles– Literature– Debates– Blogs, social media …

• We would like automatic methods to help us understand this content

3

Motivation

• Personalized recommender systems• Social network analysis• Exploratory tools for scientists• The digital humanities• …

4

The Digital Humanities

5

Dimensionality reduction

The quick brown fox jumps over the sly lazy dog

6


The quick brown fox jumps over the sly lazy dog[5 6 37 1 4 30 9 22 570 12]

7


The quick brown fox jumps over the sly lazy dog[5 6 37 1 4 30 9 22 570 12]

Foxes Dogs Jumping[40% 40% 20% ]

8

Latent Variable Models

Z

XΦParameters

Latent variables

Observed dataData Points

Dimensionality(X) >> dimensionality(Z)Z is a bottleneck, which finds a compressed, low-dimensional representation of X

Latent Feature Models forSocial Networks

Alice Bob

Claire

Latent Feature Models forSocial Networks

CyclingFishingRunning

WaltzRunning

TangoSalsa

Alice Bob

Claire

Miller, Griffiths, Jordan (2009)Latent Feature Relational Model


WaltzRunning

TangoSalsa

Cycling Fishing Running Tango Salsa Waltz

Alice

Bob

ClaireZ =

Alice Bob

Claire

14

Latent Representations

• Binary latent feature

• Latent class

• Mixed membership

Cycling Fishing Running Tango Salsa WaltzAlice 1 1 1Bob 1 1Claire 1 1

Cycling Fishing Running Tango Salsa WaltzAlice 0.2 0.4 0.4Bob 0.5 0.5Claire 0.9 0.1

Cycling Fishing Running Tango Salsa WaltzAlice 1Bob 1Claire 1

15



• Latent class





16



• Latent class





17

Latent Variable ModelsAs Matrix Factorization

18

Latent Variable ModelsAs Matrix Factorization



WaltzRunning

TangoSalsa


Alice

Bob

ClaireZ =

Alice Bob

Claire



WaltzRunning

TangoSalsa


Alice

Bob

ClaireZ =

Alice Bob

Claire E[Y] =(ZWZT)

21

Topics

Topic 1Reinforcement learning

Topic 2Learning algorithms

Topic 3Character recognition

Distributionover allwords indictionary

A vector of discrete probabilities (sums to one)

22

Topics




Top 10 words

23

Topics




Top 10 words

24

Latent Dirichlet Allocation(Blei et al., 2003)

•For each document d• Draw its topic proportion θ(d) ~ Dirichlet(α)• For each word wd,n

• Draw a topic assignment zd,n ~ Discrete(θ(d))• Draw a word from the chosen topic wd,n ~ Discrete(φZd,n)

φ

25


•For each topic k• Draw its distribution over words φ(k) ~ Dirichlet(β)• For each word wd,n


φ

26




φ

27




φ

28




φ

29




φ

30




φ

31

LDA as Matrix Factorization

θ φTx

32

Let’s say we want to build an LDAtopic model on Wikipedia

33

LDA on Wikipedia

102

103

104

105

-780

-760

-740

-720

-700

-680

-660

-640

-620

-600

Time (s)

Avg

. Log

Lik

elih

ood

VB (10,000 documents)

1 hour 6 hours

12 hours

10 mins

34

LDA on Wikipedia

102

103

104

105

-780

-760

-740

-720

-700

-680

-660

-640

-620

-600

Time (s)

Avg

. Log

Lik

elih

ood



1 hour 6 hours

12 hours

10 mins

35

LDA on Wikipedia

102

103

104

105

-780

-760

-740

-720

-700

-680

-660

-640

-620

-600

Time (s)

Avg

. Log

Lik

elih

ood



1 full iteration = 3.5 days!

1 hour 6 hours

12 hours

10 mins

36

LDA on Wikipedia

Stochastic variational inference

102

103

104

105

-780

-760

-740

-720

-700

-680

-660

-640

-620

-600

Time (s)

Avg

. Log

Lik

elih

ood

Stochastic VB (all documents)



Stochastic variational inference

1 hour 6 hours

12 hours

10 mins

37

LDA on Wikipedia

Stochastic collapsed variational inference

102

103

104

105

-780

-760

-740

-720

-700

-680

-660

-640

-620

-600

Time (s)

Avg

. Log

Lik

elih

ood

SCVB0 (all documents)Stochastic VB (all documents)VB (10,000 documents)VB (100,000 documents)

1 hour 6 hours

12 hours

10 mins

38

Available tools

VB Collapsed Gibbs Sampling Collapsed VB

Batch Blei et al. (2003) Griffiths and Steyvers (2004)

Teh et al. (2007), Asuncion et al.

(2009)

Stochastic Hoffman et al. (2010, 2013)

Mimno et al. (2012) (partially collapsed VB/Gibbs hybrid)

???

39

Available tools

VB Collapsed Gibbs Sampling Collapsed VB

Batch Blei et al. (2003) Griffiths and Steyvers (2004)

Teh et al. (2007), Asuncion et al.

(2009)

Stochastic Hoffman et al. (2010, 2013)

Mimno et al. (2012) (partially collapsed VB/Gibbs hybrid)

???

40

Collapsed Inference for LDAGriffiths and Steyvers (2004)

• Marginalize out the parameters, and perform inference on the latent variables only

Z

𝛉

𝚽 Z

41


• Marginalize out the parameters, and perform inference on the latent variables only

– Simpler, faster and fewer update equations– Better mixing for Gibbs sampling

42

• Collapsed Gibbs sampler


43



Word-topic counts

44



Document-topic counts

45



Topic counts

46

Stochastic Optimization for ML

Stochastic algorithms– While (not converged)

• Process a subset of the dataset, to estimate the update• Update parameters

47


• Stochastic gradient descent– Estimate the gradient

• Stochastic variational inference(Hoffman et al. 2010, 2013)– Estimate the natural gradient of the variational

parameters• Online EM (Cappe and Moulines, 2009)

– Estimate E-step sufficient statistics

48

Goal: Build a Fast, Accurate,Scalable Algorithm for LDA

• Collapsed LDA– Easy to implement– Fast– Accurate– Mixes well / propagates information quickly

• Stochastic algorithms– Scalable

• Quickly forgets random initialization• Memory requirements, update time independent of size of data set• Can estimate topics before a single pass of the data is complete

• Our contribution: an algorithm which gets the best of both worlds

49

Variational Bayesian Inference

• An optimization strategy for performing posterior inference, i.e. estimating Pr(Z|X)

P

Q

50



KL(Q || P)

P

Q

51



KL(Q || P)P

Q

52

Collapsed Variational Bayes(Teh et al., 2007)

• K-dimensional discrete variational distributions for each token

53



• Mean field assumption

54



• Mean field assumption

• Improved variational bound

55

Collapsed VBMean field assumption

The Quick Brown Fox Jumped Over

Foxes 0.33 0.5 0.5 1 0 0.2

Dogs 0.33 0.3 0.5 0 0 0.2

Jumping 0.33 0.2 0 0 1 0.6

Words

Topics

56




Foxes 0 1 1 1 0 0

Dogs 1 0 0 0 0 0

Jumping 0 0 0 0 1 1

57


• CVB0 (Asuncion et al., 2009)


58




Foxes 0.33 0.5 0.5 1 0 0.2

Dogs 0.33 0.3 0.5 0 0 0.2

Jumping 0.33 0.2 0 0 1 0.6

59




Foxes 0.33 0.5 0.5 1 0 0.2

Dogs 0.33 0.3 0.5 0 0 0.2

Jumping 0.33 0.2 0 0 1 0.6

60




Foxes 0.33 0.5 0.9 1 0 0.2

Dogs 0.33 0.3 0.1 0 0 0.2

Jumping 0.33 0.2 0 0 1 0.6

61

CVB0 Statistics

• Simple sums over the variational parameters

62



• Stochastic variational inference(Hoffman et al. 2010, 2013)– Estimate the natural gradient of the variational parameters

• Online EM (Cappe and Moulines, 2009)– Estimate E-step sufficient statistics

• Stochastic CVB0– Estimate the CVB0 statistics

63



• Stochastic variational inference(Hoffman et al. 2010, 2013)– Estimate the natural gradient of the variational parameters

• Online EM (Cappe and Moulines, 2009)– Estimate E-step sufficient statistics

• Stochastic CVB0– Estimate the CVB0 statistics

64

Estimating CVB0 Statistics

65


• Pick a random word i from a random document j

66


• Pick a random word i from a random document j

• An unbiased estimator is:

67

Stochastic CVB0

• In an online algorithm, we cannot store the variational parameters

• But we can update them!

68

Stochastic CVB0

• Keep an online average of the CVB0 statistics

69

Extra Refinements

• Optional burn-in passes per document

• Minibatches

• Operating on sparse counts

70

Stochastic CVB0Putting it all Together

71

Experimental Results – Large Scale

72

Experimental Results – Large Scale

73

Experimental Results – Small Scale

• Real-time or near real-time results are important for EDA applications

• Human participants shown the top ten words from each topic

74

Experimental Results – Small Scale

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

SCVB0

SVB

NIPS (5 Seconds) New York Times (60 Seconds)

Mean number of errors

Standard deviations: 1.1 1.2 1.0 2.4

75

Convergence Analysis

• Theorem: with an appropriate sequence of step sizes, SCVB0 converges to a stationary point of the MAP, with adjusted hyper-parameters

76


• Step 1) An alternative derivation of “batch SCVB0” as an EM algorithm for MAP

EM statistics:

E-step responsibilites

77



EM statistics:

E-step:

Equivalent to SCVB0 update, but withhyper-parameters adjusted by one

78



EM statistics:

M-step:

E-step:

Synchronize parameters (estimated EM statistics)with the EM statistics

79


• Step 2) Stochastic CVB0 is a Robbins Monro stochastic approximation algorithm for finding the fixed points of this EM algorithm

80



Goal: Find the roots of a function

81



Observe noisy measurement

82



Observe noisy measurement

Move in the direction of the noisy measurement

83



The step that the EM algorithm takes

84


• Step 3) Show that the stochastic approximation algorithm converges

• A Lyapunov function is an “objective function” for an SA algorithm.

• The existence of such a function, with certain conditions holding, is sufficient for convergence with an appropriate sequence of step sizes

85




• The existence of such a function, with certain properties holding, is sufficient for convergence with an appropriate sequence of step sizes

• We show that (the negative of the Lagrangian of)

the EM lower bound is such a Lyapunov function

86




• The existence of such a function, with certain properties holding, is sufficient for convergence with an appropriate sequence of step sizes

• We show that (the negative of the Lagrangian of)

the EM lower bound is such a Lyapunov function

87

Future work

• Exploit sparsity

• Parallelization

• Nonparametric extensions

• Generalizations to other models?

88

Probabilistic Soft Logic(Lise Getoor’s research group, see psl.cs.umd.edu )

User-specified logical rules

http://psl.cs.umd.edu/

89



Probabilistic model


90



Probabilistic model

Fast inference


91



Probabilistic model

Fast inference

Structured predictionEntity resolution

Collective classification

…

Link prediction


92

Publications from my Thesis Work

Algorithm papers• J. R. Foulds, L. Boyles, C. DuBois, P. Smyth and M. Welling. Stochastic collapsed variational

Bayesian inference for latent Dirichlet allocation. KDD 2013.

• J. R. Foulds and P. Smyth. Annealing Paths for the Evaluation of Topic Models. UAI 2014.

Modeling papers• J. R. Foulds, P. Smyth. Modeling scientific impact with topical influence regression. EMNLP

2013.

• J. R. Foulds, A. Asuncion, C. DuBois, C. T. Butts, P. Smyth. A dynamic relational infinite feature model for longitudinal social networks. AI STATS 2011

93

Other publications• C. DuBois, J. R. Foulds, P. Smyth. Latent set models for two-mode network data. ICWSM 2011.

• J. R. Foulds, N. Navaroli, P. Smyth, A. Ihler. Revisiting MAP estimation, message passing and perfect graphs. AI STATS 2011.

• J. R. Foulds and P. Smyth. Multi-instance mixture models and semi-supervised learning. SIAM SDM 2011.

• J. R. Foulds and E. Frank. Speeding up and boosting diverse density learning. Discovery Science, 2010.

• J. R. Foulds and E. Frank. A review of multi-instance learning assumptions. Knowledge Engineering Review, 25(1), 2010.

• J. R. Foulds and E. Frank. Revisiting multiple-instance learning via embedded instance selection. Australasian Joint Conference on Artificial Intelligence, 2008.

• J. R. Foulds and L. R. Foulds, A probabilistic dynamic programming model of rape seed harvesting. International Journal of Operational Research 2006, 1(4), 2006.

• J. R. Foulds and L. R. Foulds, Bridge lane direction specification for sustainable traffic management. Asia-Pacific Journal of Operational Research, 23(2), 2006.

94

Thanks to my Collaborators

• My PhD advisor, Padhraic Smyth

• SCVB0 is also joint work with:– Levi Boyles– Chris DuBois– Max Welling

95

Thank You!

Questions?

fast and accurate inference for topic models

Documents

reinforcement learningtopic

learning algorithmstopic

one21topics topic

zwzttopics topic

words22topics topic

santa cruz

social media

accurate inference