semantic history embedding in online generative topic models pu wang (presenter) authors: loulwah...
TRANSCRIPT
Semantic History Embedding in
OnlineGenerative Topic
ModelsPu Wang (presenter)
Authors:Loulwah AlSumait ([email protected])Daniel Barbará ([email protected])Carlotta Domeniconi ([email protected])
Department of Computer ScienceGeorge Mason University
SDM 2009
Outline Introduction and related work Online LDA (OLDA) Parameter Generation
Sliding history window Contribution weights
Experiments Conclusion and future work
Introduction When a topic is observed at a certain
time, it is more likely to appear in the future
previously discovered topics hold important information about the underlying structure of data
Incorporating such information in future knowledge discovery can enhance the inferred topics
Related Work Q. Sun, R. Li et al. ACL 2008.
LDA-based Fisher kernel to measure the text semantic similarity between blocks of LDA documents
X. Wang et al. ICDM 2007 Topical N-Gram model that automatically identified
feasible N-grams based on the context that surround it
X. Phan et al. IW3C2 2008. a classifier on both a small set of labeled documents
in addition to an LDA topic model estimated from Wikipedia.
Tracking Topics
Tracking Topics
Nd
M t
K
zti
wti
t
t
t
t
t Time
(time between t & t+1 = ε)
St
Topic Evolution Tracking
PriorsConstruction
Emerging Topic
Detection
t
t
t+1
Nd
Mt+1
K
zit+1
wit+1
t+1
t+1
t+1
t+1
S t+ 1
Emerging Topic List
Emerging Topic List
t+1
t+1
t+1
t+1
Online LDA (OLDA)
Inference Process
t
jv
t
i
i
t
i
V
v
tKVjv
tjw
KVjwttt
itt
i
C
CjzP
1 ,
,,
,
,,,|
βαzw
tK
k
KDkd
tKDjd
kd
t
i
jd
t
i
C
C
,
,
1 ,
,
Current stream
Historicobservations
Parameter Generation
Simple inference problem Gibbs Sampling Current
streamHistoric
observations
Topic Evolution Tracking Topic alignment over time Handles changes in lexicon, topic drift
Topic 1 (0.65)
Bank (0.44), money (0.35), loan (0.21)
Topic 2 (0.35)
Factory (0.53), production (0.34), labor (0.13)
Topic 1 (0.43) Bank (0.5), credit (0.32), money (0.18)
Topic 2 (0.57) Factory (0.48), cost (0.32), manufacturing (0.2)
t Time t+1
P(topic) P(word|topic)
Aligned topicsover time
Sliding History Window Consider all topic-word distributions within
a “sliding history window” (δ) Alternatives for keeping track of history at
time t full memory, δ= t short memory, δ=1 Intermediate memory, δ= c
Matrix Evolution MatrixDictionary
Topic distribution over time
Contribution Control Evolution Tuning Parameters ω
Individual weights of models Decaying history: ω1 < ω2<…< ωδ
Equal contributions: ω1 = ω2=…= ωδ
Total weight of history (vs. weight of new observations)
Balanced weights (sum=1) Biased toward the past (sum>1) Biased toward the future (sum<1)
Parameter Generation Priors of Topic distribution over words at
time t+1
Generate topic distribution
ωB
β
)(
)1(1
)2(1
)()(
tk
tk
tk
tk
tk
Experimental Design “Matlab Topic Modeling Toolbox”, by Mark Steyvers
and Tom Griffiths Datasets:
NIPS Proceedings from 1988-2000 1,740 papers, 13,649 unique words, 2,301,375 word tokens 13 streams, size from 90 to 250 doc’s per stream
Reuters-21578 News from 26-FEB-1987 to 19-OCT-1987 10,337 documents; 12,112 unique words; 793,936 word tokens 30 streams (29/340 doc’s, 1/517 doc’s)
Baselines: OLDAfixed: no memory OLDA (ω(1) ): short memory
Performance Evaluation measure: Perplexity Test set: documents of next year or stream
ReutersOLDA with fixed β vs. OLDA with semantic β
No memory
ReutersOLDA with different window size and weights• Increasing window size enhanced prediction
• Incremental history information (δ>1,sum>1) did not improve topic estimation at all Increase window size
short memory
Equal contribution
Incremental History Information
NIPSOLDA with Different Window
No memory
Short memory
• Increasing window size enhanced prediction w.r.t. short memory
• Window size greater than 3 enhanced prediction
• Effect of total weight
•
NIPSOLDA with Different Total Weight
No memory
Sum of weight = 1
Decrease sum of weights
Models with lower total weight resulted in better prediction
NIPS & ReutersOLDA with Different Total Weight
• Variable sum(ω)
• δ = 2Decrease to
tal
sum of weights
Increase total
sum of weights
NIPSOLDA with Equal vs Decaying History Contribution
Conclusions the effect of embedding semantic information in
LDA topic modeling of text streams Parameter generation based on topical
structures inferred in the past Semantic embedding enhances OLDA prediction Effect of
Total influence of history, History window size, and Equal or decaying contributions
Future work use of prior-knowledge effect of embedded historic semantics on detecting
emerging and/or periodic topics