storylines from streaming text
DESCRIPTION
Storylines from Streaming Text. The Infinite Topic Cluster Model. Amr Ahmed, Jake Eisenstein, Qirong Ho Alex Smola, Choon Hui Teo, Eric Xing. Carnegie Mellon University. Yahoo! Research. Outline. Visualizing a news stream Goals Clusters & Content analysis - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Storylines from Streaming Text](https://reader035.vdocument.in/reader035/viewer/2022062723/56813ff4550346895dab1549/html5/thumbnails/1.jpg)
Storylines from Streaming Text
The Infinite Topic Cluster Model
Amr Ahmed, Jake Eisenstein, Qirong Ho Alex Smola, Choon Hui Teo, Eric Xing
Carnegie Mellon University Yahoo! Research
![Page 2: Storylines from Streaming Text](https://reader035.vdocument.in/reader035/viewer/2022062723/56813ff4550346895dab1549/html5/thumbnails/2.jpg)
Outline
• Visualizing a news stream• Goals
Clusters & Content analysis• The story-topic model
Recurrent Chinese Restaurant ProcessLatent Dirichlet Allocation
• Examples
![Page 3: Storylines from Streaming Text](https://reader035.vdocument.in/reader035/viewer/2022062723/56813ff4550346895dab1549/html5/thumbnails/3.jpg)
News Stream
![Page 4: Storylines from Streaming Text](https://reader035.vdocument.in/reader035/viewer/2022062723/56813ff4550346895dab1549/html5/thumbnails/4.jpg)
News Stream
• Realtime news stream • Multiple sources (Reuters, AP, CNN, ...)• Same story from multiple sources• Stories are related
• Goals• Aggregate articles into a storyline• Analyze the storyline (topics, entities)
![Page 5: Storylines from Streaming Text](https://reader035.vdocument.in/reader035/viewer/2022062723/56813ff4550346895dab1549/html5/thumbnails/5.jpg)
Precursors
![Page 6: Storylines from Streaming Text](https://reader035.vdocument.in/reader035/viewer/2022062723/56813ff4550346895dab1549/html5/thumbnails/6.jpg)
Evolutionary Clustering / RCRP
• Assume active story distribution at time t
• Draw story indicator• Draw words from story
distribution • Down-weight story counts
for next day
Ahmed & Xing, 2008
![Page 7: Storylines from Streaming Text](https://reader035.vdocument.in/reader035/viewer/2022062723/56813ff4550346895dab1549/html5/thumbnails/7.jpg)
Clustering / RCRP
• Pro• Nonparametric model of story generation
(no need to model frequency of stories)• No fixed number of stories• Efficient inference via collapsed sampler
• Con• We learn nothing!• No content analysis
![Page 8: Storylines from Streaming Text](https://reader035.vdocument.in/reader035/viewer/2022062723/56813ff4550346895dab1549/html5/thumbnails/8.jpg)
Latent Dirichlet Allocation
• Generate topic distribution per article
• Draw topics per word from topic distribution
• Draw words from topic specific word distribution
Blei, Ng, Jordan, 2003
![Page 9: Storylines from Streaming Text](https://reader035.vdocument.in/reader035/viewer/2022062723/56813ff4550346895dab1549/html5/thumbnails/9.jpg)
Latent Dirichlet Allocation
• Pro• Topical analysis of stories• Topical analysis of words (meaning,
saliency)• More documents improve estimates
• Con• No clustering
![Page 10: Storylines from Streaming Text](https://reader035.vdocument.in/reader035/viewer/2022062723/56813ff4550346895dab1549/html5/thumbnails/10.jpg)
• Named entities are special, topics less(e.g. Tiger Woods and his mistresses)
• Some stories are strange(topical mixture is not enough - dirty models)
• Articles deviate from general story(Hierarchical DP)
More Issues
![Page 11: Storylines from Streaming Text](https://reader035.vdocument.in/reader035/viewer/2022062723/56813ff4550346895dab1549/html5/thumbnails/11.jpg)
Storylines
![Page 12: Storylines from Streaming Text](https://reader035.vdocument.in/reader035/viewer/2022062723/56813ff4550346895dab1549/html5/thumbnails/12.jpg)
Storylines Model• Topic model• Topics per cluster• RCRP for cluster• Hierarchical DP
for article• Separate model
for named entities
• Story specific correction
![Page 13: Storylines from Streaming Text](https://reader035.vdocument.in/reader035/viewer/2022062723/56813ff4550346895dab1549/html5/thumbnails/13.jpg)
Inference
• We receive articles as a streamWant topics & stories now
• Variational inference infeasible(RCRP, sparse to dense, vocabulary size)
• We have a ‘tracking problem’• Sequential Monte Carlo• Use sampled variables of surviving
particle• Use ideas from Cannini et al. 2009
![Page 14: Storylines from Streaming Text](https://reader035.vdocument.in/reader035/viewer/2022062723/56813ff4550346895dab1549/html5/thumbnails/14.jpg)
• Proposal distribution - draw stories s, topics z
using Gibbs Sampling for each particle• Reweight particle via
• Resample particles if l2 norm too large(resample some assignments for diversity, too)
• Compare to multiplicative updates algorithmIn our case predictive likelihood yields weights
Estimation
past state
new data
![Page 15: Storylines from Streaming Text](https://reader035.vdocument.in/reader035/viewer/2022062723/56813ff4550346895dab1549/html5/thumbnails/15.jpg)
Inheritance Tree
not thre
ad
safe
![Page 16: Storylines from Streaming Text](https://reader035.vdocument.in/reader035/viewer/2022062723/56813ff4550346895dab1549/html5/thumbnails/16.jpg)
Extended Inheritance Tree
write only in the leaves
(per thread)
![Page 17: Storylines from Streaming Text](https://reader035.vdocument.in/reader035/viewer/2022062723/56813ff4550346895dab1549/html5/thumbnails/17.jpg)
Results
![Page 18: Storylines from Streaming Text](https://reader035.vdocument.in/reader035/viewer/2022062723/56813ff4550346895dab1549/html5/thumbnails/18.jpg)
Numbers ...• TDT5 (Topic Detection and Tracking)
macro-averaged minimum detection cost: 0.714
This is the best performance on TDT5!• Yahoo News data
... beats all other clustering algorithms
time entities topics story words
0.84 0.90 0.86 0.75
![Page 19: Storylines from Streaming Text](https://reader035.vdocument.in/reader035/viewer/2022062723/56813ff4550346895dab1549/html5/thumbnails/19.jpg)
Stories
![Page 20: Storylines from Streaming Text](https://reader035.vdocument.in/reader035/viewer/2022062723/56813ff4550346895dab1549/html5/thumbnails/20.jpg)
Related Stories
![Page 21: Storylines from Streaming Text](https://reader035.vdocument.in/reader035/viewer/2022062723/56813ff4550346895dab1549/html5/thumbnails/21.jpg)
Issues
![Page 22: Storylines from Streaming Text](https://reader035.vdocument.in/reader035/viewer/2022062723/56813ff4550346895dab1549/html5/thumbnails/22.jpg)
To Do• Hierarchical story representation (H-RCRP)
Adams, Jordan & Ghahramani 2010• Promotion of stories (corporate process)• Recurrent formulation (reweight
statistics)• Possibly via hierarchical ddCRP metaphor
• Summarization of results• Submodular objective• Time dependence
• Personalization (explore/exploit)