text, topics, and turkers: a consensus measure for statistical topics

Text, Topics, and Turkers. Hypertext 2015 1

Text, Topics, and Turkers:A Consensus Measure for Statistical Topics

Fred Morstatter†, Jürgen Pfeffer‡, Katja Mayer*, Huan Liu†

†Arizona State UniversityTempe, Arizona, USA

‡Carnegie Mellon UniversityPittsburgh, Pennsylvania, USA

*University of ViennaVienna, Austria


Text

• Text is everywhere in research.• Text is huge:

• Too much data to read.• How can we understand what is going on in

big text data?

Source Size

Wikipedia 36 million pages

World Wide Web 100+ billion static web pages

Social Media 500 million new tweets/day


Topics

• Topic Modeling• Latent Dirichlet Allocation (LDA)

– Most commonly-used topic modeling algorithm– Discovers “topics” within a corpus

Corpus

LDA

K

Topic ID Words

Topic 1 cat, dog, horse, ...

Topic 2 ball, field, player, ...

... ...

Topic K red, green, blue, ...

Topic 1 Topic 2 ... Topic K

Document1 0.2 0.1 0.01

Document2 0.7 0.02 0.1

...

Documentn 0.1 0.3 0.01


Topics

LDA

K = 10

Topic ID Words

Topic 1 river, lake, island, mountain, area, park, antarctic, south, mountains, dam

Topic 2 relay, athletics, metres, freestyle, hurdles, ret, divisão, athletes, bundesliga, medals

... ...

Topic 10 courcelles, centimeters, mattythewhite, wine, stamps, oko, perennial, stubs, ovate, greyish

Topic 1 Topic 2 ... Topic 10

Document1 0.2 0.1 0.01

Document2 0.7 0.02 0.1

...



Topics

• How can we measure the quality of statistical topics?

• We don’t know how well humans can interpret topics.

• Problem: Does their understanding match what is going on in the corpus?


Turkers

• One Solution: Crowdsourcing• Example: Amazon’s Mechanical Turk

– Show LDA results to Turkers– Gauge their understanding– How to effectively measure understanding?


Turkers

• Previous Work: Chang et. al 2009– “Word Intrusion”– “Topic Intrusion”

Corpus

LDA

KTopic ID Words



... ...



Document1 0.2 0.1 0.01

Document2 0.7 0.02 0.1

...


“Word Intrusion”

“Topic Intrusion”


Word Intrusion

• Show the Turker 6 words in random order– Top 5 words from topic– 1 “Intruded” word– Ask Turker to choose “Intruded” word

cat dog bird truck horse snake

Topic i:

[Chang et. al 2009]


Topic Intrusion

• Show the Turker a document• Show the Turker 4 topics

– 3 most probable topics– 1 “Intruded” topic– Ask Turker to choose “Intruded” Topic

Documenti

Topic A Topic B Topic C Topic D

[Chang et. al 2009]


New Measure: Topic Consensus

Corpus

LDA

KTopic ID Words



... ...



Document1 0.2 0.1 0.01

Document2 0.7 0.02 0.1

...


“Word Intrusion”

“Topic Intrusion”

• Complements existing framework• Measures topic quality with corpus.

“Topic Consensus”


Topic Consensus: Intuition• Measures the agreement between topics and

“sections” they come from.LDA Distribution Turker Distribution


Topic Consensus: Calculation

• We are comparing probability distributions.• Jensen-Shannon Divergence.

Turker Distribution LDA Distribution


Dataset

• Scientific Abstracts• All available abstracts

since 2007.• Classified into three areas:

– Social Sciences & Humanities (SH)– Life Sciences (LS)– Physical Sciences (PE)

• Ran LDA on this dataset:– K = [10, 25, 50, 100]– 185 topics; 4 topic sets.


Turkers

• One task:

• Turkers have 3 + 1 options. • Each task solved 8 times.


Results

Topic Set

ERC

-10

ERC

-25

ERC

-50

ERC

-100

new, group, results, plan, class, ...

selection, variation, population, genetic, natural, ...


Other Topic Sets

• LDA Topics– Use New York Times dataset from one day.

25 topics, 1 topic set• Hand-Picked Topics

– Pure “Social Science & Humanities”• Sampled words that occur only in these documents.

11 topics, 1 topic set– Random Topics

• Randomly choose topics according to word distribution of corpus.25 topics, 1 topic set


Results

Topic Set

ERC

-10

ERC

-25

ERC

-50

ERC

-100 N

YT-

25

RA

ND

-25

SH-2

5


Overview of the Process

• Topic Consensus can reveal new information about the topics being studied.– Can measure topics from a new perspective.– Can help reveal topic confusion.

• Drawbacks:– Expensive– Time Consuming– Scalability


Automated Measures

1. Topic Size: Number of tokens assigned to the topic.

2. Topic Coherence: Probability that the top words co-occur in documents in the corpus.

3. Topic Coherence Significance: Significance of Topic Coherence compared to other topics.

4. Normalized Pointwise Mutual Information: Measures the association between the top words in the topics.


Measures

• Herfindahl-Hirschman Index (HHI)– Measures concentration of a market.– Used to find monopolies.– Viewed from two perspectives:

Word Probability HHI

"vaccine" "disease" "cure" "medicine" ...

5. 6.

Social Sciences Physical Sciences

Life Sciences

ERC Section HHI


Results - Correlation

Automated Measure CorrelationTopic Size -0.532Topic Coherence -0.584Topic Coherence Significance -0.788Normalized Pointwise Mutual Information

-0.774

HHI (Word Probability) -0.885HHI (ERC Section) -0.478


Results - Prediction

• Build classifier to predict actual Topic Consensus value.

• Build linear regression model:– Takes automated measures.– Predicts Topic Consensus.

• RMSE: 0.12 ± 0.02.


Acknowledgements

• Members of the DMML lab

• Office of Naval Research through grant N000141410095

• LexisNexis and HPCC Systems


Conclusion

• Introduced a new method for evaluating the interpretability of statistical topics.

• Demonstrated this measure on a real-world dataset.

• Automated this measure for scalability.


Future Work

• How sensitive are measures to top words?– Word Intrusion uses 5– Topic Intrusion uses 5– Topic Consensus uses 25

• How do measures fare on different datasets?

• Other measures that can reveal quality topics?


Auxiliary Slides


User Demographics

Sex Education Age

First Language Country of Origin


Results – Confusion Matrix


Dataset Statistics