text, topics, and turkers: a consensus measure for statistical topics
TRANSCRIPT
Text, Topics, and Turkers. Hypertext 2015 1
Text, Topics, and Turkers:A Consensus Measure for Statistical Topics
Fred Morstatter†, Jürgen Pfeffer‡, Katja Mayer*, Huan Liu†
†Arizona State UniversityTempe, Arizona, USA
‡Carnegie Mellon UniversityPittsburgh, Pennsylvania, USA
*University of ViennaVienna, Austria
Text, Topics, and Turkers. Hypertext 2015 2
Text
• Text is everywhere in research.• Text is huge:
• Too much data to read.• How can we understand what is going on in
big text data?
Source Size
Wikipedia 36 million pages
World Wide Web 100+ billion static web pages
Social Media 500 million new tweets/day
Text, Topics, and Turkers. Hypertext 2015 3
Topics
• Topic Modeling• Latent Dirichlet Allocation (LDA)
– Most commonly-used topic modeling algorithm– Discovers “topics” within a corpus
Corpus
LDA
K
Topic ID Words
Topic 1 cat, dog, horse, ...
Topic 2 ball, field, player, ...
... ...
Topic K red, green, blue, ...
Topic 1 Topic 2 ... Topic K
Document1 0.2 0.1 0.01
Document2 0.7 0.02 0.1
...
Documentn 0.1 0.3 0.01
Text, Topics, and Turkers. Hypertext 2015 4
Topics
LDA
K = 10
Topic ID Words
Topic 1 river, lake, island, mountain, area, park, antarctic, south, mountains, dam
Topic 2 relay, athletics, metres, freestyle, hurdles, ret, divisão, athletes, bundesliga, medals
... ...
Topic 10 courcelles, centimeters, mattythewhite, wine, stamps, oko, perennial, stubs, ovate, greyish
Topic 1 Topic 2 ... Topic 10
Document1 0.2 0.1 0.01
Document2 0.7 0.02 0.1
...
Documentn 0.1 0.3 0.01
Text, Topics, and Turkers. Hypertext 2015 5
Topics
• How can we measure the quality of statistical topics?
• We don’t know how well humans can interpret topics.
• Problem: Does their understanding match what is going on in the corpus?
Text, Topics, and Turkers. Hypertext 2015 6
Turkers
• One Solution: Crowdsourcing• Example: Amazon’s Mechanical Turk
– Show LDA results to Turkers– Gauge their understanding– How to effectively measure understanding?
Text, Topics, and Turkers. Hypertext 2015 7
Turkers
• Previous Work: Chang et. al 2009– “Word Intrusion”– “Topic Intrusion”
Corpus
LDA
KTopic ID Words
Topic 1 cat, dog, horse, ...
Topic 2 ball, field, player, ...
... ...
Topic K red, green, blue, ...
Topic 1 Topic 2 ... Topic K
Document1 0.2 0.1 0.01
Document2 0.7 0.02 0.1
...
Documentn 0.1 0.3 0.01
“Word Intrusion”
“Topic Intrusion”
Text, Topics, and Turkers. Hypertext 2015 8
Word Intrusion
• Show the Turker 6 words in random order– Top 5 words from topic– 1 “Intruded” word– Ask Turker to choose “Intruded” word
cat dog bird truck horse snake
Topic i:
[Chang et. al 2009]
Text, Topics, and Turkers. Hypertext 2015 9
Topic Intrusion
• Show the Turker a document• Show the Turker 4 topics
– 3 most probable topics– 1 “Intruded” topic– Ask Turker to choose “Intruded” Topic
Documenti
Topic A Topic B Topic C Topic D
[Chang et. al 2009]
Text, Topics, and Turkers. Hypertext 2015 10
New Measure: Topic Consensus
Corpus
LDA
KTopic ID Words
Topic 1 cat, dog, horse, ...
Topic 2 ball, field, player, ...
... ...
Topic K red, green, blue, ...
Topic 1 Topic 2 ... Topic K
Document1 0.2 0.1 0.01
Document2 0.7 0.02 0.1
...
Documentn 0.1 0.3 0.01
“Word Intrusion”
“Topic Intrusion”
• Complements existing framework• Measures topic quality with corpus.
“Topic Consensus”
Text, Topics, and Turkers. Hypertext 2015 11
Topic Consensus: Intuition• Measures the agreement between topics and
“sections” they come from.LDA Distribution Turker Distribution
Text, Topics, and Turkers. Hypertext 2015 12
Topic Consensus: Calculation
• We are comparing probability distributions.• Jensen-Shannon Divergence.
Turker Distribution LDA Distribution
Text, Topics, and Turkers. Hypertext 2015 13
Dataset
• Scientific Abstracts• All available abstracts
since 2007.• Classified into three areas:
– Social Sciences & Humanities (SH)– Life Sciences (LS)– Physical Sciences (PE)
• Ran LDA on this dataset:– K = [10, 25, 50, 100]– 185 topics; 4 topic sets.
Text, Topics, and Turkers. Hypertext 2015 14
Turkers
• One task:
• Turkers have 3 + 1 options. • Each task solved 8 times.
Text, Topics, and Turkers. Hypertext 2015 15
Results
Topic Set
ERC
-10
ERC
-25
ERC
-50
ERC
-100
new, group, results, plan, class, ...
selection, variation, population, genetic, natural, ...
Text, Topics, and Turkers. Hypertext 2015 16
Other Topic Sets
• LDA Topics– Use New York Times dataset from one day.
25 topics, 1 topic set• Hand-Picked Topics
– Pure “Social Science & Humanities”• Sampled words that occur only in these documents.
11 topics, 1 topic set– Random Topics
• Randomly choose topics according to word distribution of corpus.25 topics, 1 topic set
Text, Topics, and Turkers. Hypertext 2015 17
Results
Topic Set
ERC
-10
ERC
-25
ERC
-50
ERC
-100 N
YT-
25
RA
ND
-25
SH-2
5
Text, Topics, and Turkers. Hypertext 2015 18
Overview of the Process
• Topic Consensus can reveal new information about the topics being studied.– Can measure topics from a new perspective.– Can help reveal topic confusion.
• Drawbacks:– Expensive– Time Consuming– Scalability
Text, Topics, and Turkers. Hypertext 2015 19
Automated Measures
1. Topic Size: Number of tokens assigned to the topic.
2. Topic Coherence: Probability that the top words co-occur in documents in the corpus.
3. Topic Coherence Significance: Significance of Topic Coherence compared to other topics.
4. Normalized Pointwise Mutual Information: Measures the association between the top words in the topics.
Text, Topics, and Turkers. Hypertext 2015 20
Measures
• Herfindahl-Hirschman Index (HHI)– Measures concentration of a market.– Used to find monopolies.– Viewed from two perspectives:
Word Probability HHI
"vaccine" "disease" "cure" "medicine" ...
5. 6.
Social Sciences Physical Sciences
Life Sciences
ERC Section HHI
Text, Topics, and Turkers. Hypertext 2015 21
Results - Correlation
Automated Measure CorrelationTopic Size -0.532Topic Coherence -0.584Topic Coherence Significance -0.788Normalized Pointwise Mutual Information
-0.774
HHI (Word Probability) -0.885HHI (ERC Section) -0.478
Text, Topics, and Turkers. Hypertext 2015 22
Results - Prediction
• Build classifier to predict actual Topic Consensus value.
• Build linear regression model:– Takes automated measures.– Predicts Topic Consensus.
• RMSE: 0.12 ± 0.02.
Text, Topics, and Turkers. Hypertext 2015 23
Acknowledgements
• Members of the DMML lab
• Office of Naval Research through grant N000141410095
• LexisNexis and HPCC Systems
Text, Topics, and Turkers. Hypertext 2015 24
Conclusion
• Introduced a new method for evaluating the interpretability of statistical topics.
• Demonstrated this measure on a real-world dataset.
• Automated this measure for scalability.
Text, Topics, and Turkers. Hypertext 2015 25
Future Work
• How sensitive are measures to top words?– Word Intrusion uses 5– Topic Intrusion uses 5– Topic Consensus uses 25
• How do measures fare on different datasets?
• Other measures that can reveal quality topics?
Text, Topics, and Turkers. Hypertext 2015 27
User Demographics
Sex Education Age
First Language Country of Origin