-
Interactive, Multilingual Topic Models
Jordan Boyd-Graber, 2020
-
The Challenge of Big Data
Every second . . .
� 600 new blog posts appear
� 34,000 tweets are tweeted
� 30 GB of data uploaded toFacebook
Unstructured
No XML, no semantic web, noannotation. Often just raw text.
-
The Challenge of Big Data
Every second . . .
� 600 new blog posts appear
� 34,000 tweets are tweeted
� 30 GB of data uploaded toFacebook
Unstructured
No XML, no semantic web, noannotation. Often just raw text.
-
The Challenge of Big Data
Every second . . .
� 600 new blog posts appear
� 34,000 tweets are tweeted
� 30 GB of data uploaded toFacebook
Unstructured
No XML, no semantic web, noannotation. Often just raw text.
Common task: what’s going on inthis dataset.
� Intelligence analysts
� Brand monitoring
� Journalists
� Humanists
-
The Challenge of Big Data
Every second . . .
� 600 new blog posts appear
� 34,000 tweets are tweeted
� 30 GB of data uploaded toFacebook
Unstructured
No XML, no semantic web, noannotation. Often just raw text.
Common task: what’s going on inthis dataset.
� Intelligence analysts
� Brand monitoring
� Journalists
� Humanists
Common solution: topic models
-
Topic Models as a Black Box
From an input corpus and number of topics K → words to topics
Forget the Bootleg, Just Download the Movie LegallyMultiplex Heralded As
Linchpin To GrowthThe Shape of Cinema, Transformed At the Click of
a MouseA Peaceful Crew Puts
Muppets Where Its Mouth IsStock Trades: A Better Deal For Investors Isn't SimpleThe three big Internet portals begin to distinguish
among themselves as shopping malls
Red Light, Green Light: A 2-Tone L.E.D. to Simplify Screens
Corpus
-
Topic Models as a Black Box
From an input corpus and number of topics K → words to topics
computer, technology,
system, service, site,
phone, internet, machine
play, film, movie, theater,
production, star, director,
stage
sell, sale, store, product,
business, advertising,
market, consumer
TOPIC 1 TOPIC 2 TOPIC 3
-
Topic Models as a Black Box
From an input corpus and number of topics K → words to topics
Forget the Bootleg, Just Download the Movie Legally
Multiplex Heralded As Linchpin To Growth
The Shape of Cinema, Transformed At the Click of
a Mouse
A Peaceful Crew Puts Muppets Where Its Mouth Is
Stock Trades: A Better Deal For Investors Isn't Simple
The three big Internet portals begin to distinguish
among themselves as shopping mallsRed Light, Green Light: A
2-Tone L.E.D. to Simplify Screens
TOPIC 2
TOPIC 3
TOPIC 1
-
Word Intrusion
1. Take the highest probability words from a topic
Original Topic
dog, cat, horse, pig, cow
2. Take a high-probability word from another topic and add it
Topic with Intruder
dog, cat, apple, horse, pig, cow
3. We ask users to find the word that doesn’t belong
Hypothesis
If the topics are interpretable, users will consistently choose trueintruder
-
Word Intrusion
1. Take the highest probability words from a topic
Original Topic
dog, cat, horse, pig, cow
2. Take a high-probability word from another topic and add it
Topic with Intruder
dog, cat, apple, horse, pig, cow
3. We ask users to find the word that doesn’t belong
Hypothesis
If the topics are interpretable, users will consistently choose trueintruder
-
Word Intrusion
1. Take the highest probability words from a topic
Original Topic
dog, cat, horse, pig, cow
2. Take a high-probability word from another topic and add it
Topic with Intruder
dog, cat, apple, horse, pig, cow
3. We ask users to find the word that doesn’t belong
Hypothesis
If the topics are interpretable, users will consistently choose trueintruder
-
Word Intrusion: Which Topics are Interpretable?
New York Times, 50 Topics
Model Precision
Nu
mb
er
of
To
pic
s
05
10
15
0.000 0.125 0.250 0.375 0.500 0.625 0.750 0.875 1.000
committeelegislationproposalrepublican
taxis
fireplacegaragehousekitchenlist
americansjapanesejewishstatesterrorist
artistexhibitiongallerymuseumpainting
Model Precision: percentage of correct intruders found
-
Interpretability and Likelihood
Model Precision on New York TimesM
odel
Pre
cisio
nBe
tter
Predictive Log Likelihood
0.65
0.70
0.75
0.80
!2.5
!2.0
!1.5
!1.0
New York Times
! !
!
!
!
!
!
!
!
! !
!
! !!
! !!
!7.32 !7.30 !7.28 !7.26 !7.24
Wikipedia
!
!
!
!
!!!
!
!
!!
!
!!!
!!!
!7.52 !7.50 !7.48 !7.46 !7.44 !7.42 !7.40
Model P
recis
ion
Topic
Log O
dds
Model
! CTM
! LDA
! pLSI
Number of topics
! 50
! 100
! 150
Predictive Log Likelihood
0.65
0.70
0.75
0.80
!2.5
!2.0
!1.5
!1.0
New York Times
!
!
!
!
!
!
!7.32 !7.30 !7.28 !7.26 !7.24
Wikipedia
!
!!
!
!!
!7.52 !7.50 !7.48 !7.46 !7.44 !7.42 !7.40
Model P
recis
ion
Topic
Log O
dd
s
Model
! CTM
! LDA
! pLSI
Number of topics
! 50
100
150
Predictive Log Likelihood
0.65
0.70
0.75
0.80
!2.5
!2.0
!1.5
!1.0
New York Times
! !
!
!
!
!
!
!
!
! !
!
! !!
! !!
!7.32 !7.30 !7.28 !7.26 !7.24
Wikipedia
!
!
!
!
!!!
!
!
!!
!
!!!
!!!
!7.52 !7.50 !7.48 !7.46 !7.44 !7.42 !7.40
Model P
recis
ion
Topic
Log O
dds
Model
! CTM
! LDA
! pLSI
Number of topics
! 50
! 100
! 150
Predictive Log Likelihood
0.65
0.70
0.75
0.80
!2.5
!2.0
!1.5
!1.0
New York Times
!
!
!
!
!
!
!7.32 !7.30 !7.28 !7.26 !7.24
Wikipedia
!
!!
!
!!
!7.52 !7.50 !7.48 !7.46 !7.44 !7.42 !7.40
Model P
recis
ion
Topic
Log O
dds
Model
! CTM
! LDA
! pLSI
Number of topics
! 50
100
150
Held-out LikelihoodBetter
within a model, higher likelihood 6= higher interpretability
-
Interactive Topic Modeling
Yuening Hu, Jordan Boyd-Graber,Brianna Satinoff, and AlisonSmith. Interactive Topic Modeling.Machine Learning, 2014.
-
Topic Before
1election, yeltsin, russian, political, party, demo-cratic, russia, president, democracy, boris, coun-try, south, years, month, government, vote,since, leader, presidential, military
2new, york, city, state, mayor, budget, giuliani,council, cuomo, gov, plan, year, rudolph, dinkins,lead, need, governor, legislature, pataki, david
3nuclear, arms, weapon, defense, treaty, missile,world, unite, yet, soviet, lead, secretary, would,control, korea, intelligence, test, nation, country,testing
4president, bush, administration, clinton, ameri-can, force, reagan, war, unite, lead, economic,iraq, congress, america, iraqi, policy, aid, inter-national, military, see
...
20soviet, lead, gorbachev, union, west, mikhail, re-form, change, europe, leaders, poland, commu-nist, know, old, right, human, washington, west-ern, bring, party
-
Topic Before
1election, yeltsin, russian, political, party, demo-cratic, russia, president, democracy, boris, coun-try, south, years, month, government, vote,since, leader, presidential, military
2new, york, city, state, mayor, budget, giuliani,council, cuomo, gov, plan, year, rudolph, dinkins,lead, need, governor, legislature, pataki, david
3nuclear, arms, weapon, defense, treaty, missile,world, unite, yet, soviet, lead, secretary, would,control, korea, intelligence, test, nation, country,testing
4president, bush, administration, clinton, ameri-can, force, reagan, war, unite, lead, economic,iraq, congress, america, iraqi, policy, aid, inter-national, military, see
...
20soviet, lead, gorbachev, union, west, mikhail, re-form, change, europe, leaders, poland, commu-nist, know, old, right, human, washington, west-ern, bring, party
-
Topic Before
1election, yeltsin, russian, political, party, demo-cratic, russia, president, democracy, boris, coun-try, south, years, month, government, vote,since, leader, presidential, military
2new, york, city, state, mayor, budget, giuliani,council, cuomo, gov, plan, year, rudolph, dinkins,lead, need, governor, legislature, pataki, david
3nuclear, arms, weapon, defense, treaty, missile,world, unite, yet, soviet, lead, secretary, would,control, korea, intelligence, test, nation, country,testing
4president, bush, administration, clinton, ameri-can, force, reagan, war, unite, lead, economic,iraq, congress, america, iraqi, policy, aid, inter-national, military, see
...
20soviet, lead, gorbachev, union, west, mikhail, re-form, change, europe, leaders, poland, commu-nist, know, old, right, human, washington, west-ern, bring, party
Suggestion
boris, communist, gorbachev,mikhail, russia, russian, soviet,union, yeltsin
-
Topic Before
1election, yeltsin, russian, political, party, demo-cratic, russia, president, democracy, boris, coun-try, south, years, month, government, vote,since, leader, presidential, military
2new, york, city, state, mayor, budget, giuliani,council, cuomo, gov, plan, year, rudolph, dinkins,lead, need, governor, legislature, pataki, david
3nuclear, arms, weapon, defense, treaty, missile,world, unite, yet, soviet, lead, secretary, would,control, korea, intelligence, test, nation, country,testing
4president, bush, administration, clinton, ameri-can, force, reagan, war, unite, lead, economic,iraq, congress, america, iraqi, policy, aid, inter-national, military, see
...
20soviet, lead, gorbachev, union, west, mikhail, re-form, change, europe, leaders, poland, commu-nist, know, old, right, human, washington, west-ern, bring, party
Topic After
1election, democratic, south, country, president,party, africa, lead, even, democracy, leader, pres-idential, week, politics, minister, percent, voter,last, month, years
2new, york, city, state, mayor, budget, council,giuliani, gov, cuomo, year, rudolph, dinkins, leg-islature, plan, david, governor, pataki, need, cut
3nuclear, arms, weapon, treaty, defense, war, mis-sile, may, come, test, american, world, would,need, lead, get, join, yet, clinton, nation
4president, administration, bush, clinton, war,unite, force, reagan, american, america, make,nation, military, iraq, iraqi, troops, international,country, yesterday, plan
...
20soviet, union, economic, reform, yeltsin, russian,lead, russia, gorbachev, leaders, west, president,boris, moscow, europe, poland, mikhail, commu-nist, power, relations
-
Topic Before
1election, yeltsin, russian, political, party, demo-cratic, russia, president, democracy, boris, coun-try, south, years, month, government, vote,since, leader, presidential, military
2new, york, city, state, mayor, budget, giuliani,council, cuomo, gov, plan, year, rudolph, dinkins,lead, need, governor, legislature, pataki, david
3nuclear, arms, weapon, defense, treaty, missile,world, unite, yet, soviet, lead, secretary, would,control, korea, intelligence, test, nation, country,testing
4president, bush, administration, clinton, ameri-can, force, reagan, war, unite, lead, economic,iraq, congress, america, iraqi, policy, aid, inter-national, military, see
...
20soviet, lead, gorbachev, union, west, mikhail, re-form, change, europe, leaders, poland, commu-nist, know, old, right, human, washington, west-ern, bring, party
Topic After
1election, democratic, south, country, president,party, africa, lead, even, democracy, leader, pres-idential, week, politics, minister, percent, voter,last, month, years
2new, york, city, state, mayor, budget, council,giuliani, gov, cuomo, year, rudolph, dinkins, leg-islature, plan, david, governor, pataki, need, cut
3nuclear, arms, weapon, treaty, defense, war, mis-sile, may, come, test, american, world, would,need, lead, get, join, yet, clinton, nation
4president, administration, bush, clinton, war,unite, force, reagan, american, america, make,nation, military, iraq, iraqi, troops, international,country, yesterday, plan
...
20soviet, union, economic, reform, yeltsin, russian,lead, russia, gorbachev, leaders, west, president,boris, moscow, europe, poland, mikhail, commu-nist, power, relations
-
Topic Before
1election, yeltsin, russian, political, party, demo-cratic, russia, president, democracy, boris, coun-try, south, years, month, government, vote,since, leader, presidential, military
2new, york, city, state, mayor, budget, giuliani,council, cuomo, gov, plan, year, rudolph, dinkins,lead, need, governor, legislature, pataki, david
3nuclear, arms, weapon, defense, treaty, missile,world, unite, yet, soviet, lead, secretary, would,control, korea, intelligence, test, nation, country,testing
4president, bush, administration, clinton, ameri-can, force, reagan, war, unite, lead, economic,iraq, congress, america, iraqi, policy, aid, inter-national, military, see
...
20soviet, lead, gorbachev, union, west, mikhail, re-form, change, europe, leaders, poland, commu-nist, know, old, right, human, washington, west-ern, bring, party
Topic After
1election, democratic, south, country, president,party, africa, lead, even, democracy, leader, pres-idential, week, politics, minister, percent, voter,last, month, years
2new, york, city, state, mayor, budget, council,giuliani, gov, cuomo, year, rudolph, dinkins, leg-islature, plan, david, governor, pataki, need, cut
3nuclear, arms, weapon, treaty, defense, war, mis-sile, may, come, test, american, world, would,need, lead, get, join, yet, clinton, nation
4president, administration, bush, clinton, war,unite, force, reagan, american, america, make,nation, military, iraq, iraqi, troops, international,country, yesterday, plan
...
20soviet, union, economic, reform, yeltsin, russian,lead, russia, gorbachev, leaders, west, president,boris, moscow, europe, poland, mikhail, commu-nist, power, relations
-
Example: Negative Constraint
Topic Words
318
bladder, sci, spinal cord,spinal cord injury, spinal, uri-nary, urinary tract, urothelial,injury,motor, recovery, reflex, cervical,urothelium, functional recovery
-
Example: Negative Constraint
Topic Words
318
bladder, sci, spinal cord,spinal cord injury, spinal, uri-nary, urinary tract, urothelial,injury,motor, recovery, reflex, cervical,urothelium, functional recovery
Negative Constraint
spinal cord, bladder
-
Example: Negative Constraint
Topic Words
318
bladder, sci, spinal cord,spinal cord injury, spinal, uri-nary, urinary tract, urothelial,injury,motor, recovery, reflex, cervical,urothelium, functional recovery
Topic Words
318
sci, spinal cord, spinal cord injury,spinal, injury, recovery, motor, reflex,urothelial, injured, functional recovery,plasticity, locomotor, cervical, locomo-tion
Negative Constraint
spinal cord, bladder
-
Multilingual Anchoring:Interactive Topic Modeling andAlignment Across Languages
Michelle Yuan, Benjamin VanDurme, and Jordan Boyd-Graber.Neural Information ProcessingSystems, 2018.
-
(Source: National Geographic)
� Large text collections oftenrequire topic triage quickly inlow-resource settings (e.g.natural disaster, politicalinstability).
� Analysts need to examinemultilingual text collections,but are scarce in one or morelanguages.
-
Generative Approaches
� Polylingual Topic Model [Mimno et al. 2009]
� Jointlda [Jagarlamudi and Daumé 2010]
� Polylingual Tree-based Topic model [Hu et al. 2014b]
� mcta [Shi et al. 2016]
These methods are slow, assume extensive knowledge aboutlanguages, and preclude human refinement.
-
Generative Approaches
� Polylingual Topic Model [Mimno et al. 2009]
� Jointlda [Jagarlamudi and Daumé 2010]
� Polylingual Tree-based Topic model [Hu et al. 2014b]
� mcta [Shi et al. 2016]
These methods are slow, assume extensive knowledge aboutlanguages, and preclude human refinement.
-
Coral reefs have been damaged by
sources of pollution, such as coastal
development, deforestation, and
agriculture. Destruction of coral reefs
could impact food supply, protection,
and income …
farming, livestock, environment,
-
Coral reefs have been damaged by
sources of pollution, such as coastal
development, deforestation, and
agriculture. Destruction of coral reefs
could impact food supply, protection,
and income …
farming, livestock, environment,
-
Coral reefs have been damaged by
sources of pollution, such as coastal
development, deforestation, and
agriculture. Destruction of coral reefs
could impact food supply, protection,
and income …
farming, livestock, environment,
-
Coral reefs have been damaged by
sources of pollution, such as coastal
development, deforestation, and
agriculture. Destruction of coral reefs
could impact food supply, protection,
and income …
farming, livestock, environment,
-
Coral reefs have been damaged by
sources of pollution, such as coastal
development, deforestation, and
agriculture. Destruction of coral reefs
could impact food supply, protection,
and income …
farming, livestock, environment,
-
Anchor words
Definition
An anchor word is a word that appears with high probability in onetopic but with low probability in all other topics.
-
From Co-occurrence to Topics
� Normally, we want to find p(word | topic) [Blei et al. 2003b].� Instead, what if we can easily find p(word | topic) through using
anchor words and conditional word co-occurrencep(word 2 |word 1)?
-
From Co-occurrence to Topics
Q̄i ,j = p(w2 = j |w1 = i)
concealer
carburetor album
linerC1
C2
C3
Q
Q̄liner ≈ C1Q̄carburetor + C2Q̄concealer + C3Q̄album
= 0.4 ∗
0.3· · ·0.1
+ 0.2 ∗
0.1· · ·0.2
+ 0.4 ∗
0.1· · ·0.4
-
From Co-occurrence to Topics
Q̄i ,j = p(w2 = j |w1 = i)
concealer
carburetor album
linerC1
C2
C3
Q
Q̄liner ≈ C1Q̄carburetor + C2Q̄concealer + C3Q̄album
= 0.4 ∗
0.3· · ·0.1
+ 0.2 ∗
0.1· · ·0.2
+ 0.4 ∗
0.1· · ·0.4
-
Anchoring
� If an anchor word appears in a document, then its correspondingtopic is among the set of topics used to generatedocument [Arora et al. 2012].
� Anchoring algorithm uses word co-occurrence to find anchors andgradient-based inference to recover topic-worddistribution [Arora et al. 2013].
� Runtime is fast because algorithm scales with number of uniqueword types, rather than number of documents or tokens.
-
Anchoring
1. Construct co-occurrence matrix from documents with vocabularyof size V :
Q̄i ,j = p(w2 = j |w1 = i).
2. Given anchor words s1, ..., sK , approximate co-occurrencedistributions:
Q̄i ≈K∑
k=1
Ci ,kQ̄sk subject toK∑
k=1
Ci ,k = 1 and Ci ,k ≥ 0.
3. Find topic-word matrix:
Ai ,k = p(w = i | z = k) ∝ p(z = k |w = i)p(w = i)
= Ci ,k
V∑
j=1
Q̄i ,j .
-
Anchoring
1. Construct co-occurrence matrix from documents with vocabularyof size V :
Q̄i ,j = p(w2 = j |w1 = i).
2. Given anchor words s1, ..., sK , approximate co-occurrencedistributions:
Q̄i ≈K∑
k=1
Ci ,kQ̄sk subject toK∑
k=1
Ci ,k = 1 and Ci ,k ≥ 0.
3. Find topic-word matrix:
Ai ,k = p(w = i | z = k) ∝ p(z = k |w = i)p(w = i)
= Ci ,k
V∑
j=1
Q̄i ,j .
-
Anchoring
1. Construct co-occurrence matrix from documents with vocabularyof size V :
Q̄i ,j = p(w2 = j |w1 = i).
2. Given anchor words s1, ..., sK , approximate co-occurrencedistributions:
Q̄i ≈K∑
k=1
Ci ,kQ̄sk subject toK∑
k=1
Ci ,k = 1 and Ci ,k ≥ 0.
3. Find topic-word matrix:
Ai ,k = p(w = i | z = k) ∝ p(z = k |w = i)p(w = i)
= Ci ,k
V∑
j=1
Q̄i ,j .
-
Finding Anchor Words
� So far, we assume that anchor words are given.
� How do we find anchor words from documents?
-
Finding Anchor Words
concealer
carburetor album
bourgeoisie
forest
Anchor words are the vertices of the co-occurrence convex hull.
-
Finding Anchor Words
concealer
carburetor album
bourgeoisie
forest
Anchor words are the vertices of the co-occurrence convex hull.
-
Finding Anchor Words
concealer
carburetor album
bourgeoisie
forest
Anchor words are the vertices of the co-occurrence convex hull.
-
Issues with Topic Models
Topics
music concert singer voice chorus songs albumsinger pop songs music album chorale jazzcosmetics makeup eyeliner lipstick foundation primer eyeshadow
Duplicate topics.
-
Issues with Topic Models
Topics
music concert singer voice chorus songs albumsinger pop songs music album chorale jazzcosmetics makeup eyeliner lipstick foundation primer eyeshadow
Duplicate topics.
-
Issues with Topic Models
Topics
music band art history literature books earthbts taehyung idol kpop jin jungkook jimin
Ambiguous topics.Overly-specific topics.
-
Issues with Topic Models
Topics
music band art history literature books earthbts taehyung idol kpop jin jungkook jimin
Ambiguous topics.Overly-specific topics.
-
Interactive Anchoring
� Incorporating interactivity in topic modeling has shown to improvequality of model [Hu et al. 2014a].
� Anchoring algorithm offers speed for interactive work, but singleanchors are unintuitive to users.
� Ankura is an interactive topic modeling system that allows usersto choose multiple anchors for each topic [Lund et al. 2017].
� After receiving human feedback, Ankura only takes a few secondsto update topic model.
These methods only work for monolingual document collections.
-
Interactive Anchoring
� Incorporating interactivity in topic modeling has shown to improvequality of model [Hu et al. 2014a].
� Anchoring algorithm offers speed for interactive work, but singleanchors are unintuitive to users.
� Ankura is an interactive topic modeling system that allows usersto choose multiple anchors for each topic [Lund et al. 2017].
� After receiving human feedback, Ankura only takes a few secondsto update topic model.
These methods only work for monolingual document collections.
-
Linking Words
Definition
Language L is a set of word types w .
Definition
Bilingual dictionary B is a subset of the Cartesianproduct L(1) × L(2), where L(1),L(2) are two, different languages.
Idea: If dictionary B contains entry (w , v), create a link between wand v .
-
Linking Words
Definition
Language L is a set of word types w .
Definition
Bilingual dictionary B is a subset of the Cartesianproduct L(1) × L(2), where L(1),L(2) are two, different languages.
Idea: If dictionary B contains entry (w , v), create a link between wand v .
-
Linking Words
Definition
Language L is a set of word types w .
Definition
Bilingual dictionary B is a subset of the Cartesianproduct L(1) × L(2), where L(1),L(2) are two, different languages.
Idea: If dictionary B contains entry (w , v), create a link between wand v .
-
Finding Multilingual Anchors
concealer
forest
carburetor
cosmopolitan
album
bourgeoisie
-
Finding Multilingual Anchors
concealer
forest
carburetor
cosmopolitan
album
bourgeoisie
-
Finding Multilingual Anchors
concealer
forest
carburetor
cosmopolitan
album
bourgeoisie
Bourgeoisie does not have a Chinese translation, so it
cannot be picked as an anchor word even if it is the farthest word from the convex hull.
-
Finding Multilingual Anchors
concealer
forest
carburetor
cosmopolitan
album
bourgeoisie
Bourgeoisie does not have a Chinese translation, so it
cannot be picked as an anchor word even if it is the farthest word from the convex hull.
-
Finding Multilingual Anchors
concealer
forest
carburetor
cosmopolitan
album
bourgeoisie
Bourgeoisie does not have a Chinese translation, so it
cannot be picked as an anchor word even if it is the farthest word from the convex hull.
Forest and its translation are not the furthest
points from their respective convex hull, but neither are too
close. So, they are chosen as the next anchor words.
-
Finding Multilingual Anchors
concealer
forest
carburetor
cosmopolitan
album
bourgeoisie
Bourgeoisie does not have a Chinese translation, so it
cannot be picked as an anchor word even if it is the farthest word from the convex hull.
Forest and its translation are not the furthest
points from their respective convex hull, but neither are too
close. So, they are chosen as the next anchor words.
-
Multilingual Anchoring
1. Given a dictionary, create links between words that are translationsof each other.
2. Select an anchor word for each language such that the words arelinked and span of anchor words is maximized.
3. Once anchor words are found, separately find topic-worddistributions for each language.
-
� What if dictionary entries are scarce or inaccurate?
� What if topics aren’t aligned properly across languages?
Incorporate human-in-the-loop topic modeling tools.
-
� What if dictionary entries are scarce or inaccurate?
� What if topics aren’t aligned properly across languages?
Incorporate human-in-the-loop topic modeling tools.
-
MTAnchor
-
Experiments
Datasets:
1. Wikipedia articles (en, zh)
2. Amazon reviews (en, zh)
3. lorelei documents (en, si)
-
Experiments
Metrics:
1. Classification accuracy� Intra-lingual: train topic model on documents in one language and
test on other documents in the same languages� Cross-lingual: train topic model on documents in one language and
test on other documents in a different language.
2. Topic coherence [Lau et al. 2014].� Intrinsic: use the trained documents as the reference corpus to
measure local interpretability.� Extrinsic: use a large dataset (i.e. entire Wikipedia) as the reference
corpus to measure global interpretability.
-
Comparing Models
Classification accuracy
Dataset Method en-izh-isi-i
en-czh-csi-c
WikipediaMultilingualanchoring
69.5% 71.2% 50.4% 47.8%
MTAnchor(maximum)
80.7% 75.3% 57.6% 54.5%
MTAnchor(median)
69.5% 71.4% 50.3% 47.2%
mcta 51.6% 33.4% 23.2% 39.8%
AmazonMultilingualanchoring
59.8% 61.1% 51.7% 53.2%
mcta 49.5% 50.6% 50.3% 49.5%
loreleiMultilingualanchoring
20.8% 32.7% 24.5% 24.7%
mcta 13.0% 26.5% 4.1% 15.6%
-
Comparing Models
Topic coherence
Dataset Method en-izh-isi-i
en-ezh-esi-e
WikipediaMultilingualanchoring
0.14 0.18 0.08 0.13
MTAnchor(maximum)
0.20 0.20 0.10 0.15
MTAnchor(median)
0.14 0.18 0.08 0.13
mcta 0.13 0.09 0.00 0.04
AmazonMultilingualanchoring
0.07 0.06 0.03 0.05
mcta -0.03 0.02 0.02 0.01
loreleiMultilingualanchoring
0.08 0.00 0.03 n/a
mcta 0.13 0.00 0.04 n/a
-
Multilingual Anchoring Is Much Faster
00.20.40.60.8
F1 S
core
Train: Chinese
MethodMCTAMTAnchor (max)Multilingual Anchoring
Train: English
Test: Chinese
0 100 200 3000
0.20.40.60.8
0 100 200 300Time (min)
Test: English
-
Improving Topics Through Interactivity
0 5 10 15 20Time (min)
0.70
0.75
0.80
0.85F1
Sco
re
-
Improving Topics Through Interactivity
0 10 20Time (min)
0.65
0.70
0.75
0.80
0.85
F1 S
core
-
Improving Topics Through Interactivity
0.40.50.60.70.80.9
F1 S
core
Train: Chinese Train: EnglishDev: Chinese
0 10 200.40.50.60.70.80.9
0 10 20Time (min)
Dev: English
-
Comparing Topics
Dataset Method Topic
Wikipedia mcta dog san movie mexican fighter novel california主演 改編 本 小說 拍攝 角色 戰士
Multilingual anchoring adventure daughter bob kong hong robert movie主演 改編 本片 飾演 冒冒冒險險險 講述 編劇
MTAnchor kong hong movie office martial box reception主演 改編 飾演 本片 演演演員員員 編編編劇劇劇 講述
Amazon mcta woman food eat person baby god chapter來貨 頂頂 水 耳機 貨物 張傑 傑 同樣
Multilingual anchoring eat diet food recipe healthy lose weight健健健康康康 幫 吃 身體 全面 同事 中醫
lorelei mcta help need floodrelief please families needed victimMultilingual anchoring aranayake warning landslide site missing nbro areas
-
Why Not Use Deep Learning?
� Neural networks are data-hungry and unsuitable for low-resourcelanguages
� Deep learning models take long amounts of time to train
� Pathologies of neural models make interpretationdifficult [Feng et al. 2018]
-
Summary
� Anchoring algorithm can be applied in multilingual settings.
� People can provide helpful linguistic or cultural knowledge toconstruct better multilingual topic models.
-
Future Work
� Apply human-in-the-loop algorithms to other tasks in NLP.
� Better understand the effect of human feedback on cross-lingualrepresentation learning.
-
ALTO: Active Learning withTopic Overviews for SpeedingLabel Induction and DocumentLabeling
Forough Poursabzi-Sangdeh,Jordan Boyd-Graber, LeahFindlater, and Kevin Seppi.Association for ComputationalLinguistics, 2016.
-
Many Documents
-
Sort into Categories
-
Evaluation
� User study
� 40 minutes
� Sort documents into categories� What information / interface helps best
� Train a classifier on human examples� Compare classifier labels to expert judgements
-
Evaluation
� User study
� 40 minutes
� Sort documents into categories� What information / interface helps best
� Train a classifier on human examples� Compare classifier labels to expert judgements
-
Evaluation
� User study
� 40 minutes
� Sort documents into categories� What information / interface helps best
� Train a classifier on human examples� Compare classifier labels to expert judgements
-
Evaluation
� User study
� 40 minutes
� Sort documents into categories� What information / interface helps best
� Train a classifier on human examples (don’t tell them how manylabels)
� Compare classifier labels to expert judgements
-
Evaluation
� User study
� 40 minutes
� Sort documents into categories� What information / interface helps best
� Train a classifier on human examples (don’t tell them how manylabels)
� Compare classifier labels to expert judgements (purity)
purity(U,G) =1
N
∑
l
maxj|Ul ∩ Gj |, (1)
-
Which is more Useful?
Who should drive?
-
Which is more Useful?
Active Learning Topic Models
-
Direct users to document
-
● ●
●
●●
● ● ●
0.2
0.3
0.4
0.5
10 20 30 40Elapsed Time (min)
Med
ian
(ove
r pa
rtic
ipan
ts)
condition● No Organization with Active Learning
Active learning if time is short
-
● ●
●
●●
● ● ●
0.2
0.3
0.4
0.5
10 20 30 40Elapsed Time (min)
Med
ian
(ove
r pa
rtic
ipan
ts)
condition● No Organization with Active Learning
No Organization, no Active Learning
Better than status quo
-
● ●
●
●●
● ● ●
0.2
0.3
0.4
0.5
10 20 30 40Elapsed Time (min)
Med
ian
(ove
r pa
rtic
ipan
ts)
condition● No Organization with Active Learning
No Organization, no Active Learning
Topics with Active Learning
Active learning can help topic models
-
● ●
●
●●
● ● ●
0.2
0.3
0.4
0.5
Purity
10 20 30 40Elapsed Time (min)
Med
ian
(ove
r pa
rtic
ipan
ts)
condition● No Organization with Active Learning
No Organization, no Active Learning
Topics with Active Learning
Topics, no Active Learning
Topic models help users understand the collection
-
● ●
●
●●
● ● ●
0.2
0.3
0.4
0.5
Purity
10 20 30 40Elapsed Time (min)
Med
ian
(ove
r pa
rtic
ipan
ts)
condition● No Organization with Active Learning
No Organization, no Active Learning
Topics with Active Learning
Topics, no Active Learning
Moral: machines and humans together (if you let them)
-
Ongoing and Future Work
� Embedding interactivity in applications
� Visualizations to measure machine learning explainability
� Using morphology in infinite representations
� Multilingual analysis
-
Ongoing and Future Work
� Embedding interactivity in applications
� Visualizations to measure machine learning explainability
� Using morphology in infinite representations
� Multilingual analysis
-
Thanks
Collaborators
Yuening Hu (UMD), Ke Zhai (UMD), Viet-An Nguyen (UMD), DaveBlei (Princeton), Jonathan Chang (Facebook), Philip Resnik (UMD),Christiane Fellbaum (Princeton), Jerry Zhu (Wisconsin), Sean Gerrish(Sift), Chong Wang (CMU), Dan Osherson (Princeton), SineadWilliamson (CMU)
Funders
-
References
Alvin Grissom II, He He, Jordan Boyd-Graber, and John Morgan.
2014.Don’t until the final verb wait: Reinforcement learning for simultaneous machine translation.In Empirical Methods in Natural Language Processing.
Sanjeev Arora, Rong Ge, and Ankur Moitra.
2012.Learning topic models–going beyond SVD.In Foundations of Computer Science (FOCS).
Sanjeev Arora, Rong Ge, Yonatan Halpern, David Mimno, Ankur Moitra, David Sontag, Yichen Wu, and Michael
Zhu.2013.A practical algorithm for topic modeling with provable guarantees.In Proceedings of the International Conference of Machine Learning.
David M. Blei, Andrew Ng, and Michael Jordan.
2003a.Latent Dirichlet allocation.Machine Learning Journal, 3.
David M. Blei, Andrew Y. Ng, and Michael I. Jordan.
2003b.Latent Dirichlet allocation.Machine Learning Journal.
Jordan L. Boyd-Graber, Sonya S. Nikolova, Karyn A. Moffatt, Kenrick C. Kin, Joshua Y. Lee, Lester W. Mackey,
Marilyn M. Tremaine, and Maria M. Klawe.2006.Participatory design with proxies: Developing a desktop-PDA system to support people with aphasia.In International Conference on Human Factors in Computing Systems.
Shi Feng, Eric Wallace, Alvin Grissom II, Pedro Rodriguez, Mohit Iyyer, and Jordan Boyd-Graber.
2018.Pathologies of neural models make interpretation difficult.In Proceedings of Emperical Methods in Natural Language Processing.
He He, Jordan Boyd-Graber, and Hal Daumé III.
2016.Opponent modeling in deep reinforcement learning.In International Conference on Machine Learning.
Thomas Hofmann.
1999.Probabilistic latent semantic analysis.In Proceedings of Uncertainty in Artificial Intelligence.
Yuening Hu, Jordan Boyd-Graber, Hal Daumé III, and Z. Irene Ying.
2013.Binary to bushy: Bayesian hierarchical clustering with the beta coalescent.In Neural Information Processing Systems.
Yuening Hu, Jordan Boyd-Graber, Brianna Satinoff, and Alison Smith.
2014a.Interactive topic modeling.
Yuening Hu, Ke Zhai, Vlad Eidelman, and Jordan Boyd-Graber.
2014b.Polylingual tree-based topic models for translation domain adaptation.In Proceedings of the Association for Computational Linguistics.
Yuening Hu, Ke Zhai, Vlad Eidelman, and Jordan Boyd-Graber.
2014c.Polylingual tree-based topic models for translation domain adaptation.In Association for Computational Linguistics.
Mohit Iyyer, Varun Manjunatha, Jordan Boyd-Graber, and Hal Daumé III.
2015.Deep unordered composition rivals syntactic methods for text classification.In Association for Computational Linguistics.
Jagadeesh Jagarlamudi and Hal Daumé.
2010.Extracting multilingual topics from unaligned comparable corpora.In Proceedings of the European Conference on Information Retrieval.
T. Landauer and S. Dumais.
1997.Solutions to Plato’s problem: The latent semantic analsyis theory of acquisition, induction and representation ofknowledge.Psychological Review, (104).
Jey Han Lau, David Newman, and Timothy Baldwin.
2014.Machine reading tea leaves: Automatically evaluating topic coherence and topic model quality.In Proceedings of the European Chapter of the Association for Computational Linguistics.
Jeffrey Lund, Connor Cook, Kevin Seppi, and Jordan Boyd-Graber.
2017.Tandem anchoring: A multiword anchor approach for interactive topic modeling.In Proceedings of the Association for Computational Linguistics.
Xiaojuan Ma and Perry R. Cook.
2009.How well do visual verbs work in daily communication for young and old adults?In International Conference on Human Factors in Computing Systems, pages 361–364.
David Mimno, Hanna M Wallach, Jason Naradowsky, David A Smith, and Andrew McCallum.
2009.Polylingual topic models.In Proceedings of Emperical Methods in Natural Language Processing.
Viet-An Nguyen, Jordan Boyd-Graber, and Stephen Altschul.
2013.Dirichlet mixtures, the dirichlet process, and the structure of protein space.Journal of Computational Biology, 20(1).
Sonya S. Nikolova, Jordan Boyd-Graber, Christiane Fellbaum, and Perry Cook.
2009.Better vocabularies for assistive communication aids: Connecting terms using semantic networks and untrainedannotators.In ACM Conference on Computers and Accessibility.
Bei Shi, Wai Lam, Lidong Bing, and Yinqing Xu.
2016.Detecting common discussion topics across culture from news reader comments.In Proceedings of the Association for Computational Linguistics.
-
Latent Dirichlet Allocation: A Generative Model
� Focus in this talk: statistical methods� Model: story of how your data came to be� Latent variables: missing pieces of your story� Statistical inference: filling in those missing pieces
� We use latent Dirichlet allocation (LDA) [Blei et al. 2003a], a fullyBayesian version of pLSI [Hofmann 1999], probabilistic version ofLSA [Landauer and Dumais 1997]
-
Latent Dirichlet Allocation: A Generative Model
MNθd zn wn
Kβk
α
λ
� For each topic k ∈ {1, . . . ,K}, draw a multinomial distribution βkfrom a Dirichlet distribution with parameter λ
� For each document d ∈ {1, . . . ,M}, draw a multinomialdistribution θd from a Dirichlet distribution with parameter α
� For each word position n ∈ {1, . . . ,N}, select a hidden topic znfrom the multinomial distribution parameterized by θ.
� Choose the observed word wn from the distribution βzn .
-
Latent Dirichlet Allocation: A Generative Model
MNθd zn wn
Kβk
α
λ
� For each topic k ∈ {1, . . . ,K}, draw a multinomial distribution βkfrom a Dirichlet distribution with parameter λ
� For each document d ∈ {1, . . . ,M}, draw a multinomialdistribution θd from a Dirichlet distribution with parameter α
� For each word position n ∈ {1, . . . ,N}, select a hidden topic znfrom the multinomial distribution parameterized by θ.
� Choose the observed word wn from the distribution βzn .
-
Latent Dirichlet Allocation: A Generative Model
MNθd zn wn
Kβk
α
λ
� For each topic k ∈ {1, . . . ,K}, draw a multinomial distribution βkfrom a Dirichlet distribution with parameter λ
� For each document d ∈ {1, . . . ,M}, draw a multinomialdistribution θd from a Dirichlet distribution with parameter α
� For each word position n ∈ {1, . . . ,N}, select a hidden topic znfrom the multinomial distribution parameterized by θ.
� Choose the observed word wn from the distribution βzn .
-
Latent Dirichlet Allocation: A Generative Model
MNθd zn wn
Kβk
α
λ
� For each topic k ∈ {1, . . . ,K}, draw a multinomial distribution βkfrom a Dirichlet distribution with parameter λ
� For each document d ∈ {1, . . . ,M}, draw a multinomialdistribution θd from a Dirichlet distribution with parameter α
� For each word position n ∈ {1, . . . ,N}, select a hidden topic znfrom the multinomial distribution parameterized by θ.
� Choose the observed word wn from the distribution βzn .
-
Latent Dirichlet Allocation: A Generative Model
MNθd zn wn
Kβk
α
λ
� For each topic k ∈ {1, . . . ,K}, draw a multinomial distribution βkfrom a Dirichlet distribution with parameter λ
� For each document d ∈ {1, . . . ,M}, draw a multinomialdistribution θd from a Dirichlet distribution with parameter α
� For each word position n ∈ {1, . . . ,N}, select a hidden topic znfrom the multinomial distribution parameterized by θ.
� Choose the observed word wn from the distribution βzn .
We use statistical inference to uncover the most likely unobservedvariables given observed data.
-
βkK
zn
wn
θdα
MNd
ηk Forget the Bootleg, Just Download the Movie Legally
Multiplex Heralded As Linchpin To
GrowthThe Shape of
Cinema, Transformed At the Click of a
Mouse A Peaceful Crew Puts Muppets
Where Its Mouth Is
Stock Trades: A Better Deal For Investors Isn't
Simple
Internet portals begin to distinguish among themselves as shopping malls
Red Light, Green Light: A
2-Tone L.E.D. to Simplify Screens
TOPIC 2"BUSINESS"
TOPIC 3"ENTERTAINMENT"
TOPIC 1"TECHNOLOGY"
computer,
technology,
system,
service, site,
phone,
internet,
machine
play, film,
movie, theater,
production,
star, director,
stage
sell, sale,
store, product,
business,
advertising,
market,
consumer
TOPIC 1
TOPIC 2
TOPIC 3
-
Inference
� We are interested in posterior distribution
p(Z |X ,Θ) (2)
� Here, latent variables are topic assignments z and topics θ. X isthe words (divided into documents), and Θ are hyperparameters toDirichlet distributions: α for topic proportion, λ for topics.
p(z ,β,θ|w , α, λ) (3)
p(w , z ,θ,β|α, λ) =∏
k
p(βk |λ)∏
d
p(θd |α)∏
n
p(zd ,n|θd)p(wd ,n|βzd,n)
-
Inference
� We are interested in posterior distribution
p(Z |X ,Θ) (2)
� Here, latent variables are topic assignments z and topics θ. X isthe words (divided into documents), and Θ are hyperparameters toDirichlet distributions: α for topic proportion, λ for topics.
p(z ,β,θ|w , α, λ) (3)
p(w , z ,θ,β|α, λ) =∏
k
p(βk |λ)∏
d
p(θd |α)∏
n
p(zd ,n|θd)p(wd ,n|βzd,n)
-
Inference
� We are interested in posterior distribution
p(Z |X ,Θ) (2)
� Here, latent variables are topic assignments z and topics θ. X isthe words (divided into documents), and Θ are hyperparameters toDirichlet distributions: α for topic proportion, λ for topics.
p(z ,β,θ|w , α, λ) (3)
p(w , z ,θ,β|α, λ) =∏
k
p(βk |λ)∏
d
p(θd |α)∏
n
p(zd ,n|θd)p(wd ,n|βzd,n)
-
Gibbs Sampling
� A form of Markov Chain Monte Carlo
� Chain is a sequence of random variable states
� Given a state {z1, . . . zN} given certain technical conditions,drawing zk ∼ p(z1, . . . zk−1, zk+1, . . . zN |X ,Θ) for all k(repeatedly) results in a Markov Chain whose stationarydistribution is the posterior.
� For notational convenience, call z with zd ,n removed z−d ,n
-
Inference
Hollywood studios are preparing to let people
download and buy electronic copies of movies over
the Internet, much as record labels now sell songs for
99 cents through Apple Computer's iTunes music store
and other online services ...
computer, technology,
system, service, site,
phone, internet, machine
play, film, movie, theater,
production, star, director,
stage
sell, sale, store, product,
business, advertising,
market, consumer
-
Inference
Hollywood studios are preparing to let people
download and buy electronic copies of movies over
the Internet, much as record labels now sell songs for
99 cents through Apple Computer's iTunes music store
and other online services ...
computer, technology,
system, service, site,
phone, internet, machine
play, film, movie, theater,
production, star, director,
stage
sell, sale, store, product,
business, advertising,
market, consumer
-
Inference
Hollywood studios are preparing to let people
download and buy electronic copies of movies over
the Internet, much as record labels now sell songs for
99 cents through Apple Computer's iTunes music store
and other online services ...
computer, technology,
system, service, site,
phone, internet, machine
play, film, movie, theater,
production, star, director,
stage
sell, sale, store, product,
business, advertising,
market, consumer
-
Inference
Hollywood studios are preparing to let people
download and buy electronic copies of movies over
the Internet, much as record labels now sell songs for
99 cents through Apple Computer's iTunes music store
and other online services ...
computer, technology,
system, service, site,
phone, internet, machine
play, film, movie, theater,
production, star, director,
stage
sell, sale, store, product,
business, advertising,
market, consumer
-
Inference
Hollywood studios are preparing to let people
download and buy electronic copies of movies over
the Internet, much as record labels now sell songs for
99 cents through Apple Computer's iTunes music store
and other online services ...
computer, technology,
system, service, site,
phone, internet, machine
play, film, movie, theater,
production, star, director,
stage
sell, sale, store, product,
business, advertising,
market, consumer
-
Inference
Hollywood studios are preparing to let people
download and buy electronic copies of movies over
the Internet, much as record labels now sell songs for
99 cents through Apple Computer's iTunes music store
and other online services ...
computer, technology,
system, service, site,
phone, internet, machine
play, film, movie, theater,
production, star, director,
stage
sell, sale, store, product,
business, advertising,
market, consumer
-
Inference
Hollywood studios are preparing to let people
download and buy electronic copies of movies over
the Internet, much as record labels now sell songs for
99 cents through Apple Computer's iTunes music store
and other online services ...
computer, technology,
system, service, site,
phone, internet, machine
play, film, movie, theater,
production, star, director,
stage
sell, sale, store, product,
business, advertising,
market, consumer
-
Gibbs Sampling
� For LDA, we will sample the topic assignments
� Thus, we want:
p(zd ,n = k |z−d ,n,w , α, λ) =p(zd ,n = k, z−d ,n|w , α, λ)
p(z−d ,n|w , α, λ)
� The topics and per-document topic proportions are integrated out/ marginalized / collapsed
� Let nd ,i be the number of words taking topic i in document d . Letvk,w be the number of times word w is used in topic k .
=
∫θd
(∏i 6=k θ
αi+nd,i−1d
)θαk+nd,id dθd
∫βk
(∏i 6=wd,n β
λi+vk,i−1k,i
)βλi+vk,ik,wd,n
dβk∫θd
(∏i θαi+nd,i−1d
)dθd
∫βk
(∏i β
λi+vk,i−1k,i
)dβk
-
Gibbs Sampling
� For LDA, we will sample the topic assignments
� Thus, we want:
p(zd ,n = k |z−d ,n,w , α, λ) =p(zd ,n = k, z−d ,n|w , α, λ)
p(z−d ,n|w , α, λ)
� The topics and per-document topic proportions are integrated out/ marginalized / collapsed
� Let nd ,i be the number of words taking topic i in document d . Letvk,w be the number of times word w is used in topic k .
=
∫θd
(∏i 6=k θ
αi+nd,i−1d
)θαk+nd,id dθd
∫βk
(∏i 6=wd,n β
λi+vk,i−1k,i
)βλi+vk,ik,wd,n
dβk∫θd
(∏i θαi+nd,i−1d
)dθd
∫βk
(∏i β
λi+vk,i−1k,i
)dβk
-
Gibbs Sampling
� For LDA, we will sample the topic assignments
� The topics and per-document topic proportions are integrated out/ marginalized / Rao-Blackwellized
� Thus, we want:
p(zd ,n = k |z−d ,n,w , α, λ) =nd ,k + αk∑Ki nd ,i + αi
vk,wd,n + λwd,n∑i vk,i + λi
-
Gibbs Sampling
� Integral is normalizer of Dirichlet distribution
∫
βk
(∏
i
βλi+vk,i−1k,i
)dβk =
∏Vi |βi + vk,i|∑V
i βi + vk,i
� So we can simplify
∫θd
(∏i 6=k θ
αi+nd,i−1d
)θαk+nd,id dθd
∫βk
(∏i 6=wd,n
βλi+vk,i−1k,i
)βλi+vk,ik,wd,n
dβk∫θd
(∏i θαi+nd,i−1d
)dθd
∫βk
(∏i β
λi+vk,i−1k,i
)dβk
=
|αk+nd,k+1|∑K
i αi+nd,i+1
∏Ki 6=k |αk + nd,k∏K
i |αi+nd,i|∑K
i αi+nd,i
|λwd,n +vk,wd,n +1
|∑V
i λi+vk,i+1
∏Vi 6=wd,n
|λk + vk,wd,n∏Vi |λi+vk,i|∑V
i λi+vk,i
-
Gibbs Sampling
� Integral is normalizer of Dirichlet distribution
∫
βk
(∏
i
βλi+vk,i−1k,i
)dβk =
∏Vi |βi + vk,i|∑V
i βi + vk,i
� So we can simplify
∫θd
(∏i 6=k θ
αi+nd,i−1d
)θαk+nd,id dθd
∫βk
(∏i 6=wd,n
βλi+vk,i−1k,i
)βλi+vk,ik,wd,n
dβk∫θd
(∏i θαi+nd,i−1d
)dθd
∫βk
(∏i β
λi+vk,i−1k,i
)dβk
=
|αk+nd,k+1|∑K
i αi+nd,i+1
∏Ki 6=k |αk + nd,k∏K
i |αi+nd,i|∑K
i αi+nd,i
|λwd,n +vk,wd,n +1
|∑V
i λi+vk,i+1
∏Vi 6=wd,n
|λk + vk,wd,n∏Vi |λi+vk,i|∑V
i λi+vk,i
-
Gamma Function Identity
z =Γ(z + 1)
Γ(z)(4)
|αk+nd,k+1|∑K
i αi+nd,i+1
∏Ki 6=k |αk + nd,k∏K
i |αi+nd,i|∑K
i αi+nd,i
|λwd,n +vk,wd,n +1
|∑V
i λi+vk,i+1
∏Vi 6=wd,n
|λk + vk,wd,n∏Vi |λi+vk,i|∑V
i λi+vk,i
=nd,k + αk∑Ki nd,i + αi
vk,wd,n + λwd,n∑i vk,i + λi
-
Gibbs Sampling Equation
nd ,k + αk∑Ki nd ,i + αi
vk,wd,n + λwd,n∑i vk,i + λi
(5)
� Number of times document d uses topic k
� Number of times topic k uses word type wd ,n� Dirichlet parameter for document to topic distribution
� Dirichlet parameter for topic to word distribution
� How much this document likes topic k
� How much this topic likes word wd ,n
-
Gibbs Sampling Equation
nd ,k + αk∑Ki nd ,i + αi
vk,wd,n + λwd,n∑i vk,i + λi
(5)
� Number of times document d uses topic k
� Number of times topic k uses word type wd ,n� Dirichlet parameter for document to topic distribution
� Dirichlet parameter for topic to word distribution
� How much this document likes topic k
� How much this topic likes word wd ,n
-
Gibbs Sampling Equation
nd ,k + αk∑Ki nd ,i + αi
vk,wd,n + λwd,n∑i vk,i + λi
(5)
� Number of times document d uses topic k
� Number of times topic k uses word type wd ,n� Dirichlet parameter for document to topic distribution
� Dirichlet parameter for topic to word distribution
� How much this document likes topic k
� How much this topic likes word wd ,n
-
Gibbs Sampling Equation
nd ,k + αk∑Ki nd ,i + αi
vk,wd,n + λwd,n∑i vk,i + λi
(5)
� Number of times document d uses topic k
� Number of times topic k uses word type wd ,n� Dirichlet parameter for document to topic distribution
� Dirichlet parameter for topic to word distribution
� How much this document likes topic k
� How much this topic likes word wd ,n
-
Gibbs Sampling Equation
nd ,k + αk∑Ki nd ,i + αi
vk,wd,n + λwd,n∑i vk,i + λi
(5)
� Number of times document d uses topic k
� Number of times topic k uses word type wd ,n� Dirichlet parameter for document to topic distribution
� Dirichlet parameter for topic to word distribution
� How much this document likes topic k
� How much this topic likes word wd ,n
-
Gibbs Sampling Equation
nd ,k + αk∑Ki nd ,i + αi
vk,wd,n + λwd,n∑i vk,i + λi
(5)
� Number of times document d uses topic k
� Number of times topic k uses word type wd ,n� Dirichlet parameter for document to topic distribution
� Dirichlet parameter for topic to word distribution
� How much this document likes topic k
� How much this topic likes word wd ,n
-
Sample Document
-
Sample Document
-
Randomly Assign Topics
-
Randomly Assign Topics
-
Total Topic Counts
Sampling Equation
nd ,k + αk∑Ki nd ,i + αi
vk,wd,n + λwd,n∑i vk,i + λi
-
Total Topic Counts
Sampling Equation
nd ,k + αk∑Ki nd ,i + αi
vk,wd,n + λwd,n∑i vk,i + λi
-
Total Topic Counts
Sampling Equation
nd ,k + αk∑Ki nd ,i + αi
vk,wd,n + λwd,n∑i vk,i + λi
-
We want to sample this word . . .
-
We want to sample this word . . .
-
Decrement its count
-
What is the conditional distribution for this topic?
-
Part 1: How much does this document like each topic?
-
Part 1: How much does this document like each topic?
Sampling Equation
nd ,k + αk∑Ki nd ,i + αi
vk,wd,n + λwd,n∑i vk,i + λi
-
Part 1: How much does this document like each topic?
Sampling Equation
nd ,k + αk∑Ki nd ,i + αi
vk,wd,n + λwd,n∑i vk,i + λi
-
Part 1: How much does this document like each topic?
Sampling Equation
nd ,k + αk∑Ki nd ,i + αi
vk,wd,n + λwd,n∑i vk,i + λi
-
Part 2: How much does each topic like the word?
Sampling Equation
nd ,k + αk∑Ki nd ,i + αi
vk,wd,n + λwd,n∑i vk,i + λi
-
Part 2: How much does each topic like the word?
Sampling Equation
nd ,k + αk∑Ki nd ,i + αi
vk,wd,n + λwd,n∑i vk,i + λi
-
Part 2: How much does each topic like the word?
Sampling Equation
nd ,k + αk∑Ki nd ,i + αi
vk,wd,n + λwd,n∑i vk,i + λi
-
Geometric interpretation
-
Geometric interpretation
-
Geometric interpretation
-
Update counts
-
Update counts
-
Update counts
-
Details: how to sample from a distribution
Topic 1
Topic 2
Topic 3
Topic 4
Topic 5
QKi 6=k � (↵i + nd ,i )
�(↵k+nd,k+1)�(
PKi ↵i+nd,i+1)QK
i �(↵i+nd,i)�(
PKi ↵i+nd,i)
QVi 6=wd,n � (�i + vk,i )
�⇣�k+vk,wd,n+1
⌘
�(PV
i �i+vk,i+1)QVi �(�i+vk,i)
�(PV
i �i+vk,i)
=nd ,k + ↵kPKi nd ,i + ↵i
vk,wd,n + �wd,nPi vk,i + �i
Gamma Function Identity
z =�(z + 1)
�(z)(6)
Norm
alize
0.0
1.0
0.112
-
Algorithm
1. For each iteration i :1.1 For each document d and word n currently assigned to zold :
1.1.1 Decrement nd,zold and vzold ,wd,n
1.1.2 Sample znew = k with probability proportional tond,k+αk∑Ki nd,i+αi
vk,wd,n+λwd,n∑
i vk,i+λi
1.1.3 Increment nd,znew and vznew ,wd,n
-
Näıve Implementation
Algorithm
1. For each iteration i :1.1 For each document d and word n currently assigned to zold :
1.1.1 Decrement nd,zold and vzold ,wd,n
1.1.2 Sample znew = k with probability proportional tond,k+αk∑Ki nd,i+αi
vk,wd,n+λwd,n∑
i vk,i+λi
1.1.3 Increment nd,znew and vznew ,wd,n
-
Desiderata
� Hyperparameters: Sample them too (slice sampling)
� Initialization: Random
� Sampling: Until likelihood converges
� Lag / burn-in: Difference of opinion on this
� Number of chains: Should do more than one
-
Available implementations
� Mallet (http://mallet.cs.umass.edu)
� LDAC (http://www.cs.princeton.edu/ blei/lda-c)
� Topicmod (http://code.google.com/p/topicmod)
-
SHLDA Model
���
�
�
��
�
∞
��
�
�
��
� �
�
�� ���
����
����
���
�
∞ ��
�� �
�
-
Infvoc Classification Accuracy
S=
155
τ 0=
64κ
=0.
6
infvoc αβ = 3k T = 40k U = 10 52.683
fixvoc vb-dict 45.514
fixvoc vb-null 49.390fixvoc hybrid-dict 46.720fixvoc hybrid-null 50.474
fixvoc-hash vb-dict 52.525fixvoc-hash vb-full T = 30k 51.653fixvoc-hash hybrid-dict 50.948fixvoc-hash hybrid-full T = 30k 50.948
dtm-dict tcv = 0.001 62.845
Table: Classification accuracy based on 50 topic features extracted from 20newsgroups data.
-
Infvoc Classification Accuracy
S=
155
τ 0=
64κ
=0.
6
infvoc αβ = 3k T = 40k U = 10 52.683
fixvoc vb-dict 45.514
fixvoc vb-null 49.390fixvoc hybrid-dict 46.720fixvoc hybrid-null 50.474
fixvoc-hash vb-dict 52.525
fixvoc-hash vb-full T = 30k 51.653fixvoc-hash hybrid-dict 50.948fixvoc-hash hybrid-full T = 30k 50.948
dtm-dict tcv = 0.001 62.845
Table: Classification accuracy based on 50 topic features extracted from 20newsgroups data.
Topics learned with hashing are no longer interpretable, they can onlybe used as features.
-
Infvoc Classification Accuracy
S=
155
τ 0=
64κ
=0.
6
infvoc αβ = 3k T = 40k U = 10 52.683fixvoc vb-dict 45.514
fixvoc vb-null 49.390fixvoc hybrid-dict 46.720fixvoc hybrid-null 50.474
fixvoc-hash vb-dict 52.525
fixvoc-hash vb-full T = 30k 51.653fixvoc-hash hybrid-dict 50.948fixvoc-hash hybrid-full T = 30k 50.948
dtm-dict tcv = 0.001 62.845
Table: Classification accuracy based on 50 topic features extracted from 20newsgroups data.
Topics learned with hashing are no longer interpretable, they can onlybe used as features.
-
Infvoc Classification Accuracy
S=
155
τ 0=
64κ
=0.
6
infvoc αβ = 3k T = 40k U = 10 52.683fixvoc vb-dict 45.514fixvoc vb-null 49.390fixvoc hybrid-dict 46.720fixvoc hybrid-null 50.474
fixvoc-hash vb-dict 52.525fixvoc-hash vb-full T = 30k 51.653fixvoc-hash hybrid-dict 50.948fixvoc-hash hybrid-full T = 30k 50.948
dtm-dict tcv = 0.001 62.845
Table: Classification accuracy based on 50 topic features extracted from 20newsgroups data.
Topics learned with hashing are no longer interpretable, they can onlybe used as features.
-
Infvoc Classification Accuracy
S=
155
τ 0=
64κ
=0.
6
infvoc αβ = 3k T = 40k U = 10 52.683fixvoc vb-dict 45.514fixvoc vb-null 49.390fixvoc hybrid-dict 46.720fixvoc hybrid-null 50.474
fixvoc-hash vb-dict 52.525fixvoc-hash vb-full T = 30k 51.653fixvoc-hash hybrid-dict 50.948fixvoc-hash hybrid-full T = 30k 50.948
dtm-dict tcv = 0.001 62.845
Table: Classification accuracy based on 50 topic features extracted from 20newsgroups data.
Topics learned with hashing are no longer interpretable, they can onlybe used as features.
-
Unassign (d , n,wd,n, zd,n = k)
1: T : Td ,k ← Td ,k − 12: If wd ,n /∈ Ωold ,
P : Pk,wd,n ← Pk,wd,n − 13: Else: suppose wd ,n ∈ Ωoldm ,
P : Pk,m ← Pk,m − 1W : Wk,m,wd,n ←Wk,m,wd,n − 1
-
SparseLDA
p(z = k |Z−,w) ∝ (αk + nk|d)β + nw |kβV + n·|k
(6)
∝ αkββV + n·|k︸ ︷︷ ︸
sLDA
+nk|dβ
βV + n·|k︸ ︷︷ ︸rLDA
+(αk + nk|d)nw |kβV + n·|k︸ ︷︷ ︸
qLDA
-
Tree-based sampling
p(zd ,n = k , ld ,n = λ|Z−, L−,wd ,n) (7)
∝ (αk + nk|d)∏
(i→j)∈λ
βi→j + ni→j |k∑j ′ (βi→j ′ + ni→j ′|k)
-
Factorizing Tree-Based Prior
p(z = k |Z−,w) ∝ (αk + nk|d)β + nw |kβV + n·|k
(8)
∝ αkββV + n·|k︸ ︷︷ ︸
sLDA
+nk|dβ
βV + n·|k︸ ︷︷ ︸rLDA
+(αk + nk|d)nw |kβV + n·|k︸ ︷︷ ︸
qLDA
s =∑
k,λ
αk∏
(i→j)∈λ βi→j∏(i→j)∈λ
∑j ′ (βi→j ′ + ni→j ′|k)
≤∑
k,λ
αk∏
(i→j)∈λ βi→j∏(i→j)∈λ
∑j ′ βi→j ′
= s ′. (9)
-
Factorizing Tree-Based Prior
p(z = k |Z−,w) ∝ (αk + nk|d)β + nw |kβV + n·|k
(8)
∝ αkββV + n·|k︸ ︷︷ ︸
sLDA
+nk|dβ
βV + n·|k︸ ︷︷ ︸rLDA
+(αk + nk|d)nw |kβV + n·|k︸ ︷︷ ︸
qLDA
s =∑
k,λ
αk∏
(i→j)∈λ βi→j∏(i→j)∈λ
∑j ′ (βi→j ′ + ni→j ′|k)
≤∑
k,λ
αk∏
(i→j)∈λ βi→j∏(i→j)∈λ
∑j ′ βi→j ′
= s ′. (9)
-
1: for word w in this document do2: sample = rand() ∗(s ′ + r + q)3: if sample < s ′ then4: compute s5: sample = sample ∗(s + r + q)/(s ′ + r + q)6: if sample < s then7: return topic k and path λ sampled from s8: end if9: sample − = s
10: else11: sample − = s ′12: end if13: if sample < r then14: return topic k and path λ sampled from r15: end if16: sample − = r17: return topic k and path λ sampled from q18: end for
-
Number of TopicsT50 T100 T200 T500
Naive 5.700 12.655 29.200 71.223Fast 4.935 9.222 17.559 40.691
Fast-RB 2.937 4.037 5.880 8.551Fast-RB-sD 2.675 3.795 5.400 8.363Fast-RB-sW 2.449 3.363 4.894 7.404
Fast-RB-sDW 2.225 3.241 4.672 7.424
Number of CorrelationsC50 C100 C200 C500
Näıve 11.166 12.586 13.000 15.377Fast 8.889 9.165 9.177 8.079
Fast-RB 3.995 4.078 3.858 3.156Fast-RB-sD 3.660 3.795 3.593 3.065Fast-RB-sW 3.272 3.363 3.308 2.787
Fast-RB-sDW 3.026 3.241 3.091 2.627