michael paul and roxana girju university of illinois at urbana-champaign a two-dimensional...

23
MICHAEL PAUL AND ROXANA GIRJU UNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN A Two-Dimensional Topic-Aspect Model for Discovering Multi-Faceted Topics

Upload: flora-poole

Post on 17-Dec-2015

215 views

Category:

Documents


0 download

TRANSCRIPT

MICHAEL PAUL AND ROXANA GIRJUUNIVERSITY OF ILLINOIS AT URBANA-CHAMPAIGN

A Two-Dimensional Topic-Aspect Model

for Discovering Multi-Faceted Topics

Probabilistic Topic Models

Each word token associated with hidden “topic” variable

Probabilistic approach to dimensionality reduction

Useful for uncovering latent structures in text

Basic formulation: P(w|d) = P(w|topic) P(topic|d)

Probabilistic Topic Models

“Topics” are latent distributions over wordsA topic can be interpreted of as a cluster of

wordsTopic models often cluster words by what

people would consider topicalityThere are often other dimensions in which

words could be clustered Sentiment/perspective/theme

What if we want to model both?

Previous Work

Topic-Sentiment Mixture Model (Mei et al., 2007) Words come from either topic distribution or

sentiment distribution

Topic+Perspective Model (Lin et al., 2008) Words are weighted as topical vs. ideological

Previous Work

Cross-Collection LDA (Paul and Girju, 2009) Each document belongs to a collection Each topic has a word distribution shared among

collections plus distributions unique to each collection

What if the “collection” was a hidden variable? -> Topic-Aspect Model (TAM)

Topic-Aspect Model

Each document has a multinomial topic mixture a multinomial aspect mixture

Words may depend on both!

Topic-Aspect Model

Topic and aspect mixtures are drawn independently of one another This differs from hierarchical topic models where one

depends on the other Can be thought of as two separate clustering dimensions

Topic-Aspect Model

Each word token also has 2 binary variables: the “level” (background or topical) denotes if the word depends on the

topic or not

the “route” (neutral or aspectual) denotes if the word depends on the aspect or not

A word may depend on a topic, an aspect, both, or neither

Topic-Aspect Model

A word may depend on a topic, an aspect, both, or neither

“Computational” Aspect

Route / Level Background Topical

Neutral paper, new, present speech, recognition

Aspectual algorithm, model markov, hmm, error

Topic-Aspect Model

A word may depend on a topic, an aspect, both, or neither

Route / Level Background Topical

Neutral paper, new, present speech, recognition

Aspectual language, linguistic prosody, intonation, tone

“Linguistic” Aspect

Topic-Aspect Model

A word may depend on a topic, an aspect, both, or neither

Route / Level Background Topical

Neutral paper, new, present communication, interaction

Aspectual language, linguistic conversation, social

“Linguistic” Aspect

Topic-Aspect Model

A word may depend on a topic, an aspect, both, or neither

Route / Level Background Topical

Neutral paper, new, present communication, interaction

Aspectual algorithm, model dialogue, system, user

“Computational” Aspect

Topic-Aspect Model

Generative process for a document d: Sample a topic z from P(z|d) Sample an aspect y from P(y|d) Sample a level l from P(l|d) Sample a route x from P(x|l,z)

Sample a word w from either: P(w|l=0,x=0), P(w|z,l=1,x=0), P(w|y,l=0,x=1), P(w|z,y,l=1,x=1)

Topic-Aspect Model

Distributions have Dirichlet/Beta priors Latent Dirchlet Allocation framework

Number of aspects and topics are user-supplied parameters

Straightforward inference with Gibbs sampling

Topic-Aspect Model

Semi-supervised TAM when aspect label is known

Two options: Fix P(y|d)=1 for the correct aspect label and 0

otherwise Behaves like ccLDA (Paul and Girju, 2009)

Define a prior for P(y|d) to bias it toward the true label

Experiments

Three Datasets:

4,247 abstracts from the ACL Anthology

2,173 abstracts from linguistics journals

594 articles from the Bitterlemons corpus (Lin et al., 2006) a collection of editorials on the Israeli/Palestinian

conflict

Experiments

Example: Computational Linguistics

Experiments

Example: Israeli/Palestinian Conflict Unsupervised Prior for P(aspect|d) for true label

Evaluation

Cluster coherence “word intrusion” method (Chang et al., 2009) 5 human annotators Compare against ccLDA and LDA TAM clusters are as coherent as other established

models

Evaluation

Document classification Classify Bitterlemons perspectives (Israeli vs

Palestinian) Use TAM (2 aspects + 12 topics) output as input to

SVM Use aspect mixtures and topic mixtures as features Compare against LDA

Evaluation

Document classification 2 aspects from TAM much more strongly associated

with true perspectives than 2 topics from LDA Suggests that TAM is clustering along a different

dimension than LDA by separating out another “topical” dimension (with 12 components)

Summary: Topic-Aspect Model

Can cluster along two independent dimensions

Words may be generated by both dimensions, thus clusters can be inter-related

Cluster definitions are arbitrary and their structure will depend on the data and the model parameterization (especially # of aspects/topics) Modeling with 2 aspects and many topics is shown to

produce aspect clusters corresponding to document perspectives on certain corpora

References

Chang, J.; Boyd-Graber, J.; Gerrish, S.; Wang, C.; and Blei, D. 2009. Reading tea leaves: How humans interpret topic models. In Neural Information Processing Systems.

Lin, W.; Wilson, T.; Wiebe, J.; and Hauptmann, A. 2006. Which side are you on? identifying perspectives at the document and sentence levels. In Proceedings of Tenth Conference on Natural Language Learning (CoNLL).

Lin, W.; Xing, E.; and Hauptmann, A. 2008. A joint topic and perspective model for ideological discourse. In ECML PKDD ’08: Proceedings of the European conference on Machine Learning and Knowledge Discovery in Databases -Part II, 17–32. Berlin, Heidelberg: Springer-Verlag.

Mei, Q.; Ling, X.; Wondra, M.; Su, H.; and Zhai, C. 2007. Topic sentiment mixture: modeling facets and opinions in weblogs. In WWW ’07: Proceedings of the 16th international conference on World Wide Web, 171–180.

Paul, M., and Girju, R. 2009. Cross-cultural analysis of blogs and forums with mixed-collection topic models. In Proceedings of the 2009 Conference on Empirical Methods in Natural Language Processing, 1408–1417.