Download - Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign
![Page 1: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/1.jpg)
Mining Named Entities with Temporally Correlated Bursts from Multilingual Web
News StreamsAlexander Kotov, ChengXiang Zhai, Richard
Sproat
University of Illinois at Urbana-Champaign
![Page 2: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/2.jpg)
Roadmap
Problem definitionPrevious workApproachExperimentsSummary
![Page 3: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/3.jpg)
MotivationWeb data is generated by a large number of
textual streams (news, blogs, tweets, etc.)Bursts of entity mentions (people, locations)
correspond to a particular eventBursts of entity mentions are influenced by
bursts of other entities
Intuition: bursts of semantically related entities should be temporally correlated
![Page 4: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/4.jpg)
Problem definition
time
13
25
31
46
9 8
3
96
21
21
15
14 1
0
13
12
6
11
10
457 8
54 3 2
𝑡 0 𝑡𝑇
2 13 2
11 7
24 3
5
1 2
63
time
𝑡 0 𝑡𝑇
sparsity
magnitude
time lag
entity 1
entity 2
=
?
![Page 5: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/5.jpg)
Temporally correlated bursts
Problem: given a collection of textual streams discover named entities with correlated bursts
Provide multilingual summaries of real life events
Estimate social impact of a particular event in different countries
Differentiate between local and global eventsDiscover transliterations of named entities
![Page 6: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/6.jpg)
Roadmap
Problem definitionPrevious workApproachExperimentsSummary
![Page 7: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/7.jpg)
Previous workBurst detection:
infinite-state automation (Kleinberg ’02)factorial HMMs (Krause ‘06)wavelet transformation (Zhu ’03)
Stream correlation: distance-based measures: Pearson coefficient
(Chien’05)singular spectrum transformation (Ide’05)topic based (PLSA, LDA) (Wang’09)
![Page 8: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/8.jpg)
Previous work
Smoothing is efficient for large amount of data, but not precise
Do not abstract away from the raw dataDistance based measures suffer from
magnitude and sparsity problemsTemporal lags are not considered
![Page 9: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/9.jpg)
Roadmap
Problem definitionPrevious workApproachExperimentsSummary
![Page 10: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/10.jpg)
Approach
Difference in magnitude: normalization with Markov Modulated Poisson Process
Temporal lag: flexible alignment of bursts using dynamic programming
![Page 11: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/11.jpg)
Markov-Modulated Poisson Process
• Ergodic Markov chain over finite number of states
• Each state is associated with Poisson distribution
• “Burstiness’’ of a state is represented by the intensity parameter of Poisson distribution
• States are labeled by the rank of the intensity parameter
![Page 12: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/12.jpg)
Normalization
time
25
31
46
9 8
3
96
21
21
15
14 1
0
13
12
6
11
10
457 8
54 3 2
1 1 1 1 1 1 2 2 2 2 2 1 1 1 3 3 3 3 3 3 2 1 1 1 13 3 3 31
2 13 2
13 1
1 7
24 3
5
1 2
63
time
21 1 1 1 1 1 1 1 1 1 1 1 1 1 3 3 3 1 2 2 1 1 2 1 1 12 21
mention counts
MMPP states
![Page 13: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/13.jpg)
Normalization
• MMPP consistently outperforms the baseline• The optimal performance is achieved when the
number of states is 3
![Page 14: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/14.jpg)
Burst AlignmentInput: -pair of normalized MC streams of length - threshold for ``bursty’’ states; - reward constant; - penalty function.Output: a table :
![Page 15: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/15.jpg)
Burst alignment
perfect alignement
exponential penalty
logarithmic penalty
![Page 16: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/16.jpg)
Burst alignment
• quadratic penalty function in combination with reward constant of 2 is optimal•maximum permitted temporal gap is 1 day
![Page 17: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/17.jpg)
Roadmap
Problem definitionPrevious workApproachExperimentsSummary
![Page 18: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/18.jpg)
Dataset
News data crawled from RSS feeds over 4 month
Basic named entity recognitionBasic stemming
![Page 19: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/19.jpg)
Correlated Bursts
Pattern 1: World Economic Forum in Davos, Switzerland and death of actor Heath Ledger;Pattern 2: death of Bobby FischerPattern 3: assassination of Benazir BhuttoPattern 4: French bank major trading loss incident and death of George Habash
Real life events:
![Page 20: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/20.jpg)
Mining transliterationsStatic aligned corpora:
+ identical or semantically related contents + temporal topical alignment - limited coverage
Web: + covers almost any domain - difference in burst magnitude - temporal lag between bursts
![Page 21: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/21.jpg)
Transliteration
•MMPP+DP outperforms one baseline (CS) in all entropy categories and the other baseline (PC) for low- and medium-entropy (more “bursty’’) entities;• Combination of MMPP+DP performs better than MMPP alone.
![Page 22: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/22.jpg)
Roadmap
Problem definitionPrevious workApproachExperimentsSummary
![Page 23: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/23.jpg)
Summary
Novel multi-stream text mining problemOur approach can effectively discover
correlated bursts corresponding to major and minor real life events
Effective for unsupervised discovery of transliterations
Method is data independent and not limited to textual domain
![Page 24: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/24.jpg)
Contributions
First method to use MMPP for burst detection in textual streams
Algorithm for temporally flexible stream correlation based on bursts
Unsupervised method for language-independent transliteration without any linguistic knowledge
![Page 25: Alexander Kotov, ChengXiang Zhai, Richard Sproat University of Illinois at Urbana-Champaign](https://reader036.vdocument.in/reader036/viewer/2022081720/5516fb14550346fe558b4e0b/html5/thumbnails/25.jpg)
Future work
Applying proposed method to non-textual data (e.g., sensor streams)
Burst correlations between entities different types of Web 2.0 data (news and tweets, news and blogs, news and tags, etc.)