integrating topics and syntax - thomas l. griffiths, mark steyvers, david m. blei, joshua b....
TRANSCRIPT
Integrating Topics and SynIntegrating Topics and Syntaxtax
--Thomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. TeneThomas L. Griffiths, Mark Steyvers, David M. Blei, Joshua B. Tenenbaumnbaum
Han LiuDepartment of Computer Science
University of Illinois at [email protected]
April 12th. 2005
2005-4-12 Han Liu 2
OutlineOutline
• Motivations – Motivations – Syntactic vs. semantic modelingSyntactic vs. semantic modeling• FormalizationFormalization – – Notations and terminologyNotations and terminology• Generative Models – Generative Models – pLSI; Latent Dirichlet AllocpLSI; Latent Dirichlet Alloc
ationation• Composite Models –Composite Models –HMMs + LDAHMMs + LDA• Inference – Inference – MCMC (Metropolis; Gibbs Sampling )MCMC (Metropolis; Gibbs Sampling ) • Experiments – Experiments – Performance and evaluationsPerformance and evaluations• Summary – Summary – Bayesian hierarchical modelsBayesian hierarchical models
Discussions !Discussions !
2005-4-12 Han Liu 3
MotivationsMotivations
• Statistical language modelingStatistical language modeling -- Syntactic dependencies Syntactic dependencies short range dependencies short range dependencies -- Semantic dependencies Semantic dependencies long-range long-range• Current models only consider one aspect Current models only consider one aspect
-- Hidden Markov Models (HMMs) : Hidden Markov Models (HMMs) : syntacticsyntactic modeling modeling -- Latent Dirichlet Allocation (LDA) : Latent Dirichlet Allocation (LDA) : semanticsemantic modeling modeling -- Probabilistic Latent Semantic Indexing (LSI) : Probabilistic Latent Semantic Indexing (LSI) : semantisemanti
cc modeling modeling A model which could capture both kinds of deA model which could capture both kinds of de
pendencies may be more useful!pendencies may be more useful!
2005-4-12 Han Liu 4
Problem FormalizationProblem Formalization• WordWord -- A word is an item from a vocabulary indexed by {1,
…,V}. Which is represented as unit-basis vectors. The vth word is represented by a V-vector w such that only the vth element is 1, while the others are 0
• DocumentDocument -- A document is a sequence of sequence of NN words denoted by words denoted by ww
= {= {w1, w2 , … , wN}}, where , where wi is the is the iith word in the th word in the sequence.sequence.
• Corpus Corpus -- A A corpuscorpus is a collection of is a collection of MM documents, denoted by documents, denoted by DD = =
{{w1, w2 , … , wM}}
2005-4-12 Han Liu 5
Latent Semantic StructureLatent Semantic Structure
Latent StructureLatent Structure
Words Words
),()( ww PP
w
Distribution over wordsDistribution over words
)w(
)()|w()w|(
P
PPP
Inferring latent structureInferring latent structure
...)w|( 1 nwP
PredictionPrediction
2005-4-12 Han Liu 6
Probabilistic Generative Probabilistic Generative ModelsModels
• Probabilistic Latent Semantic Indexing (pLSI)Probabilistic Latent Semantic Indexing (pLSI)
-- Hoffman (1999) ACM SIGIR
-- Probabilistic semantic model
• Latent Dirichlet Allocation (LDA)Latent Dirichlet Allocation (LDA) -- Blei, Ng, & Jordan (2003) J. of Machine Learning Res. -- Probabilistic semantic model
• Hidden Markov Models (HMMs)Hidden Markov Models (HMMs) -- Baum, & Petrie (1966) Ann. Math. Stat. -- Probabilistic syntactic model
2005-4-12 Han Liu 7
Dirichelt vs. Multinomial DistribuDirichelt vs. Multinomial Distributionstions
• Dirichlet DistributionDirichlet Distribution (conjugate prior) (conjugate prior)
• Multinomial DistributionMultinomial Distribution
1,)(
)()( 1
111
1
1 1
ki ikk
i i
ki i kp
1,!
)!()( 11
1
1 1
ki i
xk
x
ki i
ki i k
x
xXp
2005-4-12 Han Liu 8
Probabilistic LSI : Probabilistic LSI : Graphical ModelGraphical Model
model the distribution model the distribution
over topicsover topics
z
wD
d
Ndd
Topic as latent Topic as latent variablesvariables
generate a word generate a word from that topicfrom that topic
z
nn dzpzwpdpwdp )|()|()(),(
2005-4-12 Han Liu 9
Probabilistic LSI- Parameter Probabilistic LSI- Parameter EstimationEstimation
• The The log-likelihoodlog-likelihood of Probabilistic LSI of Probabilistic LSI
• EM - algorithmEM - algorithm -- E - StepE - Step
-- M- StepM- Step
2005-4-12 Han Liu 10
LDA : Graphical ModelLDA : Graphical Model
sample a distribution sample a distribution
over topicsover topics
z
wD
TNdd
sample a topicsample a topic
sample a word from sample a word from that topic that topic
2005-4-12 Han Liu 11
Latent Dirichlet AllocationLatent Dirichlet Allocation• A variant LDA developed by Griffith 2003A variant LDA developed by Griffith 2003 -- choose choose Poisson Poisson ( ( ))
-- sample sample DirDir ( ())
-- sample sample Dir( Dir( )) -- sample sample z z Multinomial (Multinomial ()) -- sample sample wwz, z, (z)(z)Multinomial (Multinomial ((z)(z))) • Model Inference Model Inference -- all the Dirichlet prior is assumed to be symmetric all the Dirichlet prior is assumed to be symmetric -- Instead of using variational inference and empirical Bay Instead of using variational inference and empirical Bay
es parameter estimation, es parameter estimation, Gibbs SamplingGibbs Sampling is adopted is adopted
2005-4-12 Han Liu 12
The Composite ModelThe Composite Model
• An intuitive representationAn intuitive representation
z z z z
w w w w
s s s s
Semantic stateSemantic state: : generate words generate words from LDAfrom LDA
Syntactic statesSyntactic states: gen: generate words from HMerate words from HMMsMs
2005-4-12 Han Liu 13
Composite Model : Composite Model : Graphical ModelGraphical Model
z
wM
(z)
TNdd
c
C
(c)
2005-4-12 Han Liu 14
Composite ModelComposite Model• All the Dirichelt are assumed to be symmetricAll the Dirichelt are assumed to be symmetric -- choose choose Poisson Poisson ( ( ))
-- sample sample (d) (d) DirDir ( ())
-- sample sample (z(zii))Dir Dir (() ) -- sample sample (c(cii))Dir Dir (() ) -- sample sample (c(ci-1i-1))Dir Dir (()) -- sample sample zzii (d)(d)Multinomial Multinomial (((d)(d)) ) -- sample sample ccii (c(ci-1i-1))Multinomial Multinomial (((c(ci-1i-1)))) -- sample sample wwiizzii, , (z(zii))Multinomial Multinomial (((z(zii)))) if if ccii = = 11
-- sample sample wwiiccii, , (c(cii))Multinomial Multinomial (((c(cii)))) if if notnot
2005-4-12 Han Liu 15
The Composite Model: The Composite Model: Generative processGenerative process
2005-4-12 Han Liu 16
Bayesian InferenceBayesian Inference• EM algorithm can be applied to the compositEM algorithm can be applied to the composit
e modele model -- treating treating (z) (z) (c) (c) (c)(c)as parametersas parameters -- log log PP((ww| | (z) (z) (c) (c) (c)(c)) as the likelihood) as the likelihood -- too many parameters and too slow convergence too many parameters and too slow convergence -- the dirichelet priors are necessary assumptions ! the dirichelet priors are necessary assumptions ! • Markov Chain Monte Carlo (MCMC) Markov Chain Monte Carlo (MCMC) -- Instead of explicitly representing Instead of explicitly representing (z) (z) (c) (c) (c)(c) , we co , we co
nsider the posterior distribution over the assignment of nsider the posterior distribution over the assignment of words to topics or classes words to topics or classes PP( ( zz||ww)) andand PP((cc||ww))
2005-4-12 Han Liu 17
Markov Chain Monte CarloMarkov Chain Monte Carlo• Sampling posterior distribution according to Sampling posterior distribution according to
a Markov Chaina Markov Chain -- an an ergodicergodic ( (irreducibleirreducible & & aperiodicaperiodic ) Markov chain ) Markov chain
converges to a unique equilibrium distribution converges to a unique equilibrium distribution ((xx)) -- Try to sample the parameters according to a Makrov c Try to sample the parameters according to a Makrov c
hain, whose equilibrium distribution hain, whose equilibrium distribution ((xx) is just he poster) is just he posterior distribution ior distribution p p ((xx))
• The key task is to construct the suitable The key task is to construct the suitable TT((x,x,x’x’))
2005-4-12 Han Liu 18
Metropolis-Hastings Metropolis-Hastings Algorithm Algorithm
• Sampling by constructing a reversible MarkoSampling by constructing a reversible Markov chainv chain
-- a a reversiblereversible Markov chain could guarantee the conditio Markov chain could guarantee the condition of the equilibrium distribution n of the equilibrium distribution ((xx))
-- Simultaneous Metropolis Hastings Algorithm holds a si Simultaneous Metropolis Hastings Algorithm holds a similar idea as milar idea as rejection samplingrejection sampling
2005-4-12 Han Liu 19
Metropolis-Hastings Algorithm Metropolis-Hastings Algorithm (cont.)(cont.)
• Algorithm Algorithm loop loop sample sample xx’ from Q( ’ from Q( x, x’x, x’);); a =min{1, (a =min{1, (((xx’’)/ )/ ((xx)))*(Q( )*(Q( xx(t)(t), x’, x’) / Q () / Q (x’, xx’, x(t)(t)))}; ))}; r = r = UU(0,1); (0,1); if a < r reject, if a < r reject, xx(t+1) (t+1) = = xx(t)(t); ; else accept, else accept, xx(t+1) (t+1) ==x’;x’; end;end; -- Metropolis Hastings Intuition Metropolis Hastings Intuition
xt
r=1.0
x*
r=p(x*)/p(xt)
x*
2005-4-12 Han Liu 20
Metropolis-Hastings Metropolis-Hastings AlgorithmAlgorithm
• Why it works Why it works
Single-site Updating algorithmSingle-site Updating algorithm
2005-4-12 Han Liu 21
Gibbs SamplingGibbs Sampling• A special case of single-site Updating MetropolA special case of single-site Updating Metropol
isis
2005-4-12 Han Liu 22
Gibbs Sampling for Composite Gibbs Sampling for Composite ModelModel
are all integrated out from the corresponding are all integrated out from the corresponding terms, hyperparameters are sampled with single-site terms, hyperparameters are sampled with single-site Metropolis-Hastings algorithmMetropolis-Hastings algorithm
2005-4-12 Han Liu 23
ExperimentsExperiments• CorporaCorpora -- Brown corpus 500 documents, 1,137,466 wordsBrown corpus 500 documents, 1,137,466 words -- TASA corpus, 37,651 documents, 12,190,931 word TASA corpus, 37,651 documents, 12,190,931 word
tokenstokens
-- NIPS corpus, 1713 documents, 4,312,614 word tokensNIPS corpus, 1713 documents, 4,312,614 word tokens
-- WW = 37,202 (Brown + TASA); = 37,202 (Brown + TASA); WW = 17,268 (NIPS) = 17,268 (NIPS)• Experimental Design Experimental Design -- one class for sentence start/end markers {., ?,!} one class for sentence start/end markers {., ?,!} -- TT=200 & =200 & CC=20 (composite); =20 (composite); CC=2 (LDA); =2 (LDA); TT=1 (HMMs)=1 (HMMs) -- 4,000 iterations, with 2000 burn in and 100 lag 4,000 iterations, with 2000 burn in and 100 lag -- 1 1stst,2,2ndnd, 3, 3rdrd Markov Chains are considered Markov Chains are considered
2005-4-12 Han Liu 24
Identifying function and Identifying function and content wordscontent words
2005-4-12 Han Liu 25
Comparative study on NIPS Comparative study on NIPS corpus (corpus (TT=100 & =100 & C C = 50)= 50)
2005-4-12 Han Liu 26
Identifying function and Identifying function and content words (NIPS)content words (NIPS)
2005-4-12 Han Liu 27
Marginal probabilitiesMarginal probabilities• Bayesian model comparisonBayesian model comparison -- P( P(ww||M M ) are calculated using the harmonic mean of the l) are calculated using the harmonic mean of the l
ikelihoods over the 2000 iterations ikelihoods over the 2000 iterations -- To evaluate the Bayes factors To evaluate the Bayes factors
2005-4-12 Han Liu 28
Part of Speech TaggingPart of Speech Tagging• Assessed performance on the Brown Assessed performance on the Brown
corpuscorpus -- One set consisted all Brown tags (297) One set consisted all Brown tags (297)
-- The other set collapsed Browns tags into 10 The other set collapsed Browns tags into 10 designationsdesignations
- - TheThe 2020thth sample used, evaluated by Adjusted Rand sample used, evaluated by Adjusted Rand IndexIndex
- - Compare with DC on the 1000 most frequent words Compare with DC on the 1000 most frequent words on 19 clusters on 19 clusters
2005-4-12 Han Liu 29
Document ClassificationDocument Classification• Evaluated by Naïve Bayes ClassifierEvaluated by Naïve Bayes Classifier -- 500 documents in Brown are classified into 15 groups 500 documents in Brown are classified into 15 groups -- The topic vectors produced by LDA and composite model The topic vectors produced by LDA and composite model
are used for training Naïve Bayes classifier are used for training Naïve Bayes classifier - - 10-flod cross validation is used to evaluate the 2010-flod cross validation is used to evaluate the 20thth sampl sampl
ee• Result (baseline accuracy: 0.09)Result (baseline accuracy: 0.09) -- Trained on Brown : LDA (0.51); 1 Trained on Brown : LDA (0.51); 1stst Composite model (0.45) Composite model (0.45) -- Brown + TASA : LDA (0.54); 1 Brown + TASA : LDA (0.54); 1stst Composite model (0.45) Composite model (0.45) - Explanation: - Explanation: only about 20% words are allocated to the sonly about 20% words are allocated to the s
emantic component, too few to find correlations!emantic component, too few to find correlations!
2005-4-12 Han Liu 30
SummarySummary
• Bayesian hierarchical models are natural for text Bayesian hierarchical models are natural for text modelingmodeling
• Simultaneously learn syntactic classes and semantic Simultaneously learn syntactic classes and semantic topics is possible through the combination of basic topics is possible through the combination of basic modules modules
• Discovering the syntactic and semantic building Discovering the syntactic and semantic building blocks form the basis of more sophisticated blocks form the basis of more sophisticated representationrepresentation
• Similar ideas could be generalized to the other areasSimilar ideas could be generalized to the other areas
2005-4-12 Han Liu 31
DiscussionsDiscussions
• Gibbs Sampling vs. EM algorithm ?Gibbs Sampling vs. EM algorithm ?• Hieratical models reduce the number of Parameters, what Hieratical models reduce the number of Parameters, what
about model complexity?about model complexity?• Equal prior for Bayesian model comparison?Equal prior for Bayesian model comparison?• Whether there is really any effect of the 4 hyper-Whether there is really any effect of the 4 hyper-
parameters?parameters?• Probabilistic LSI does not have normal distribution Probabilistic LSI does not have normal distribution
assumption, while Probabilistic PCA assumes normal!assumption, while Probabilistic PCA assumes normal!• EM is sensitive to local maxima, why Bayesian goes EM is sensitive to local maxima, why Bayesian goes
through?through?• Is document classification experiment a good evaluation?Is document classification experiment a good evaluation?• Majority vote for tagging?Majority vote for tagging?