integrating topics and syntax - thomas l. griffiths, mark steyvers, david m. blei, joshua b....

Integrating Topics and SynIntegrating Topics and Syntaxtax

--Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. TeneThomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaumnbaum

Han LiuDepartment of Computer Science

University of Illinois at [email protected]

April 12th. 2005

mailto:[email protected]

2005-4-12 Han Liu 2

OutlineOutline

• Motivations – Motivations – Syntactic vs. semantic modelingSyntactic vs. semantic modeling• FormalizationFormalization – – Notations and terminologyNotations and terminology• Generative Models – Generative Models – pLSI; Latent Dirichlet AllocpLSI; Latent Dirichlet Alloc

ationation• Composite Models –Composite Models –HMMs + LDAHMMs + LDA• Inference – Inference – MCMC (Metropolis; Gibbs Sampling )MCMC (Metropolis; Gibbs Sampling ) • Experiments – Experiments – Performance and evaluationsPerformance and evaluations• Summary – Summary – Bayesian hierarchical modelsBayesian hierarchical models

Discussions !Discussions !

2005-4-12 Han Liu 3

MotivationsMotivations

• Statistical language modelingStatistical language modeling -- Syntactic dependencies Syntactic dependencies short range dependencies short range dependencies -- Semantic dependencies Semantic dependencies long-range long-range• Current models only consider one aspect Current models only consider one aspect

-- Hidden Markov Models (HMMs) : Hidden Markov Models (HMMs) : syntacticsyntactic modeling modeling -- Latent Dirichlet Allocation (LDA) : Latent Dirichlet Allocation (LDA) : semanticsemantic modeling modeling -- Probabilistic Latent Semantic Indexing (LSI) : Probabilistic Latent Semantic Indexing (LSI) : semantisemanti

cc modeling modeling A model which could capture both kinds of deA model which could capture both kinds of de

pendencies may be more useful!pendencies may be more useful!

2005-4-12 Han Liu 4

Problem FormalizationProblem Formalization• WordWord -- A word is an item from a vocabulary indexed by {1,

…,V}. Which is represented as unit-basis vectors. The vth word is represented by a V-vector w such that only the vth element is 1, while the others are 0

• DocumentDocument -- A document is a sequence of sequence of NN words denoted by words denoted by ww

= {= {w1, w2 , … , wN}}, where , where wi is the is the iith word in the th word in the sequence.sequence.

• Corpus Corpus -- A A corpuscorpus is a collection of is a collection of MM documents, denoted by documents, denoted by DD = =

{{w1, w2 , … , wM}}

2005-4-12 Han Liu 5

Latent Semantic StructureLatent Semantic Structure

Latent StructureLatent Structure

Words Words

),()( ww PP

w

Distribution over wordsDistribution over words

)w(

)()|w()w|(

P

PPP

Inferring latent structureInferring latent structure

...)w|( 1 nwP

PredictionPrediction

2005-4-12 Han Liu 6

Probabilistic Generative Probabilistic Generative ModelsModels

• Probabilistic Latent Semantic Indexing (pLSI)Probabilistic Latent Semantic Indexing (pLSI)

-- Hoffman (1999) ACM SIGIR

-- Probabilistic semantic model

• Latent Dirichlet Allocation (LDA)Latent Dirichlet Allocation (LDA) -- Blei, Ng, & Jordan (2003) J. of Machine Learning Res. -- Probabilistic semantic model

• Hidden Markov Models (HMMs)Hidden Markov Models (HMMs) -- Baum, & Petrie (1966) Ann. Math. Stat. -- Probabilistic syntactic model

2005-4-12 Han Liu 7

Dirichelt vs. Multinomial DistribuDirichelt vs. Multinomial Distributionstions

• Dirichlet DistributionDirichlet Distribution (conjugate prior) (conjugate prior)

• Multinomial DistributionMultinomial Distribution

1,)(

)()( 1

111

1

1 1

ki ikk

i i

ki i kp

1,!

)!()( 11

1

1 1

ki i

xk

x

ki i

ki i k

x

xXp

2005-4-12 Han Liu 8

Probabilistic LSI : Probabilistic LSI : Graphical ModelGraphical Model

model the distribution model the distribution

over topicsover topics

z

wD

d

Ndd

Topic as latent Topic as latent variablesvariables

generate a word generate a word from that topicfrom that topic

z

nn dzpzwpdpwdp )|()|()(),(

2005-4-12 Han Liu 9

Probabilistic LSI- Parameter Probabilistic LSI- Parameter EstimationEstimation

• The The log-likelihoodlog-likelihood of Probabilistic LSI of Probabilistic LSI

• EM - algorithmEM - algorithm -- E - StepE - Step

-- M- StepM- Step

2005-4-12 Han Liu 10

LDA : Graphical ModelLDA : Graphical Model

sample a distribution sample a distribution

over topicsover topics

z

wD

TNdd

sample a topicsample a topic

sample a word from sample a word from that topic that topic

2005-4-12 Han Liu 11

Latent Dirichlet AllocationLatent Dirichlet Allocation• A variant LDA developed by Griffith 2003A variant LDA developed by Griffith 2003 -- choose choose Poisson Poisson ( ( ))

-- sample sample DirDir ( ())

-- sample sample Dir( Dir( )) -- sample sample z z Multinomial (Multinomial ()) -- sample sample wwz, z, (z)(z)Multinomial (Multinomial ((z)(z))) • Model Inference Model Inference -- all the Dirichlet prior is assumed to be symmetric all the Dirichlet prior is assumed to be symmetric -- Instead of using variational inference and empirical Bay Instead of using variational inference and empirical Bay

es parameter estimation, es parameter estimation, Gibbs SamplingGibbs Sampling is adopted is adopted

2005-4-12 Han Liu 12

The Composite ModelThe Composite Model

• An intuitive representationAn intuitive representation

z z z z

w w w w

s s s s

Semantic stateSemantic state: : generate words generate words from LDAfrom LDA

Syntactic statesSyntactic states: gen: generate words from HMerate words from HMMsMs

2005-4-12 Han Liu 13

Composite Model : Composite Model : Graphical ModelGraphical Model

z

wM

(z)

TNdd

c

C

(c)

2005-4-12 Han Liu 14

Composite ModelComposite Model• All the Dirichelt are assumed to be symmetricAll the Dirichelt are assumed to be symmetric -- choose choose Poisson Poisson ( ( ))

-- sample sample (d) (d) DirDir ( ())

-- sample sample (z(zii))Dir Dir (() ) -- sample sample (c(cii))Dir Dir (() ) -- sample sample (c(ci-1i-1))Dir Dir (()) -- sample sample zzii (d)(d)Multinomial Multinomial (((d)(d)) ) -- sample sample ccii (c(ci-1i-1))Multinomial Multinomial (((c(ci-1i-1)))) -- sample sample wwiizzii, , (z(zii))Multinomial Multinomial (((z(zii)))) if if ccii = = 11

-- sample sample wwiiccii, , (c(cii))Multinomial Multinomial (((c(cii)))) if if notnot

2005-4-12 Han Liu 15

The Composite Model: The Composite Model: Generative processGenerative process

2005-4-12 Han Liu 16

Bayesian InferenceBayesian Inference• EM algorithm can be applied to the compositEM algorithm can be applied to the composit

e modele model -- treating treating (z) (z) (c) (c) (c)(c)as parametersas parameters -- log log PP((ww| | (z) (z) (c) (c) (c)(c)) as the likelihood) as the likelihood -- too many parameters and too slow convergence too many parameters and too slow convergence -- the dirichelet priors are necessary assumptions ! the dirichelet priors are necessary assumptions ! • Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC) -- Instead of explicitly representing Instead of explicitly representing (z) (z) (c) (c) (c)(c) , we co , we co

nsider the posterior distribution over the assignment of nsider the posterior distribution over the assignment of words to topics or classes words to topics or classes PP( ( zz||ww)) andand PP((cc||ww))

2005-4-12 Han Liu 17

Markov Chain Monte CarloMarkov Chain Monte Carlo• Sampling posterior distribution according to Sampling posterior distribution according to

a Markov Chaina Markov Chain -- an an ergodicergodic ( (irreducibleirreducible & & aperiodicaperiodic ) Markov chain ) Markov chain

converges to a unique equilibrium distribution converges to a unique equilibrium distribution ((xx)) -- Try to sample the parameters according to a Makrov c Try to sample the parameters according to a Makrov c

hain, whose equilibrium distribution hain, whose equilibrium distribution ((xx) is just he poster) is just he posterior distribution ior distribution p p ((xx))

• The key task is to construct the suitable The key task is to construct the suitable TT((x,x,x’x’))

2005-4-12 Han Liu 18

Metropolis-Hastings Metropolis-Hastings Algorithm Algorithm

• Sampling by constructing a reversible MarkoSampling by constructing a reversible Markov chainv chain

-- a a reversiblereversible Markov chain could guarantee the conditio Markov chain could guarantee the condition of the equilibrium distribution n of the equilibrium distribution ((xx))

-- Simultaneous Metropolis Hastings Algorithm holds a si Simultaneous Metropolis Hastings Algorithm holds a similar idea as milar idea as rejection samplingrejection sampling

2005-4-12 Han Liu 19

Metropolis-Hastings Algorithm Metropolis-Hastings Algorithm (cont.)(cont.)

• Algorithm Algorithm loop loop sample sample xx’ from Q( ’ from Q( x, x’x, x’);); a =min{1, (a =min{1, (((xx’’)/ )/ ((xx)))*(Q( )*(Q( xx(t)(t), x’, x’) / Q () / Q (x’, xx’, x(t)(t)))}; ))}; r = r = UU(0,1); (0,1); if a < r reject, if a < r reject, xx(t+1) (t+1) = = xx(t)(t); ; else accept, else accept, xx(t+1) (t+1) ==x’;x’; end;end; -- Metropolis Hastings Intuition Metropolis Hastings Intuition

xt

r=1.0

x*

r=p(x*)/p(xt)

x*

2005-4-12 Han Liu 20

Metropolis-Hastings Metropolis-Hastings AlgorithmAlgorithm

• Why it works Why it works

Single-site Updating algorithmSingle-site Updating algorithm

2005-4-12 Han Liu 21

Gibbs SamplingGibbs Sampling• A special case of single-site Updating MetropolA special case of single-site Updating Metropol

isis

2005-4-12 Han Liu 22

Gibbs Sampling for Composite Gibbs Sampling for Composite ModelModel

are all integrated out from the corresponding are all integrated out from the corresponding terms, hyperparameters are sampled with single-site terms, hyperparameters are sampled with single-site Metropolis-Hastings algorithmMetropolis-Hastings algorithm

2005-4-12 Han Liu 23

ExperimentsExperiments• CorporaCorpora -- Brown corpus 500 documents, 1,137,466 wordsBrown corpus 500 documents, 1,137,466 words -- TASA corpus, 37,651 documents, 12,190,931 word TASA corpus, 37,651 documents, 12,190,931 word

tokenstokens

-- NIPS corpus, 1713 documents, 4,312,614 word tokensNIPS corpus, 1713 documents, 4,312,614 word tokens

-- WW = 37,202 (Brown + TASA); = 37,202 (Brown + TASA); WW = 17,268 (NIPS) = 17,268 (NIPS)• Experimental Design Experimental Design -- one class for sentence start/end markers {., ?,!} one class for sentence start/end markers {., ?,!} -- TT=200 & =200 & CC=20 (composite); =20 (composite); CC=2 (LDA); =2 (LDA); TT=1 (HMMs)=1 (HMMs) -- 4,000 iterations, with 2000 burn in and 100 lag 4,000 iterations, with 2000 burn in and 100 lag -- 1 1stst,2,2ndnd, 3, 3rdrd Markov Chains are considered Markov Chains are considered

2005-4-12 Han Liu 24

Identifying function and Identifying function and content wordscontent words

2005-4-12 Han Liu 25

Comparative study on NIPS Comparative study on NIPS corpus (corpus (TT=100 & =100 & C C = 50)= 50)

2005-4-12 Han Liu 26

Identifying function and Identifying function and content words (NIPS)content words (NIPS)

2005-4-12 Han Liu 27

Marginal probabilitiesMarginal probabilities• Bayesian model comparisonBayesian model comparison -- P( P(ww||M M ) are calculated using the harmonic mean of the l) are calculated using the harmonic mean of the l

ikelihoods over the 2000 iterations ikelihoods over the 2000 iterations -- To evaluate the Bayes factors To evaluate the Bayes factors

2005-4-12 Han Liu 28

Part of Speech TaggingPart of Speech Tagging• Assessed performance on the Brown Assessed performance on the Brown

corpuscorpus -- One set consisted all Brown tags (297) One set consisted all Brown tags (297)

-- The other set collapsed Browns tags into 10 The other set collapsed Browns tags into 10 designationsdesignations

- - TheThe 2020thth sample used, evaluated by Adjusted Rand sample used, evaluated by Adjusted Rand IndexIndex

- - Compare with DC on the 1000 most frequent words Compare with DC on the 1000 most frequent words on 19 clusters on 19 clusters

2005-4-12 Han Liu 29

Document ClassificationDocument Classification• Evaluated by Naïve Bayes ClassifierEvaluated by Naïve Bayes Classifier -- 500 documents in Brown are classified into 15 groups 500 documents in Brown are classified into 15 groups -- The topic vectors produced by LDA and composite model The topic vectors produced by LDA and composite model

are used for training Naïve Bayes classifier are used for training Naïve Bayes classifier - - 10-flod cross validation is used to evaluate the 2010-flod cross validation is used to evaluate the 20thth sampl sampl

ee• Result (baseline accuracy: 0.09)Result (baseline accuracy: 0.09) -- Trained on Brown : LDA (0.51); 1 Trained on Brown : LDA (0.51); 1stst Composite model (0.45) Composite model (0.45) -- Brown + TASA : LDA (0.54); 1 Brown + TASA : LDA (0.54); 1stst Composite model (0.45) Composite model (0.45) - Explanation: - Explanation: only about 20% words are allocated to the sonly about 20% words are allocated to the s

emantic component, too few to find correlations!emantic component, too few to find correlations!

2005-4-12 Han Liu 30

SummarySummary

• Bayesian hierarchical models are natural for text Bayesian hierarchical models are natural for text modelingmodeling

• Simultaneously learn syntactic classes and semantic Simultaneously learn syntactic classes and semantic topics is possible through the combination of basic topics is possible through the combination of basic modules modules

• Discovering the syntactic and semantic building Discovering the syntactic and semantic building blocks form the basis of more sophisticated blocks form the basis of more sophisticated representationrepresentation

• Similar ideas could be generalized to the other areasSimilar ideas could be generalized to the other areas

2005-4-12 Han Liu 31

DiscussionsDiscussions

• Gibbs Sampling vs. EM algorithm ?Gibbs Sampling vs. EM algorithm ?• Hieratical models reduce the number of Parameters, what Hieratical models reduce the number of Parameters, what

about model complexity?about model complexity?• Equal prior for Bayesian model comparison?Equal prior for Bayesian model comparison?• Whether there is really any effect of the 4 hyper-Whether there is really any effect of the 4 hyper-

parameters?parameters?• Probabilistic LSI does not have normal distribution Probabilistic LSI does not have normal distribution

assumption, while Probabilistic PCA assumes normal!assumption, while Probabilistic PCA assumes normal!• EM is sensitive to local maxima, why Bayesian goes EM is sensitive to local maxima, why Bayesian goes

through?through?• Is document classification experiment a good evaluation?Is document classification experiment a good evaluation?• Majority vote for tagging?Majority vote for tagging?

integrating topics and syntax - thomas l. griffiths, mark steyvers, david m. blei, joshua b....

Documents