comparative text mining
Post on 31-Dec-2015
15 Views
Preview:
DESCRIPTION
TRANSCRIPT
Comparative Text Mining
Q. Mei, C. Liu, H. Su, A. Velivelli, B. Yu, C. Zhai
DAIS The Database and Information Systems Laboratory .
at The University of Illinois at Urbana-Champaign
Large Scale Information Management
Cross-Collection Text Mining Cross-Collection Text Mining (II)
Temporal Text Mining Temporal Text Mining (II)
Spatiotemporal Text Mining Spatiotemporal Text Mining (II)
1
4
6
3
2
5
IBM LaptopReviews
APPLE LaptopReviews
DELL LaptopReviews
“DELL” specific“APPLE” specific“IBM” specificCommon Themes
Moderate, 1-2 GhzVery Fast, 3-4 GhzSlow, 100-200 MhzSpeed
Medium, 20-50 GBSmall, 5-10 GBLarge, 80-100 GBHard disk
Short, 2-1 hrsMedium, 3-2 hrsLong, 4-3 hrsBattery Life
Many applications involve a comparative analysis of several text collections
Existing work in text mining has conceptually focused on one single collection of text thus is inadequate for comparative text analysis
We aim at developing methods for comparing multiple collections of text and performing comparative text mining
…………………
Background B
Theme 1 in common: 1
Theme 1Specific
to C1
1,1
…
Theme k in common: k
Theme kSpecific
to C1
k,1
Theme 1Specific
to C2
1,2
Theme 1Specific
to Cm
1,m
Theme kSpecific
to C2
k,2
Theme kSpecific
to Cm
k,m
B1
1,i
1-C
C
k
k,i
1-C
…
d,1
d,k
B
1-B
Background
WC
- A mixture model for cross-collection comparative text mining
,1
,
( | ) (1 ) ( | )
[ ( | )
(1 ) ( | )]
d i B B
k
B d j C jj
C j i
p w C p w
p w
p w
Goal: Extract common themes and specific themes from comparable collections
Applications: Opinion extraction, business intelligence, news summarization, etc.
“Generating” word w in doc d in collection Ci
Sample results (comparing news articles about Iraq war and Afghan war)
Reference: C. Zhai, A. Velivelli, and B. Yu. A Cross-Collection Mixture Model for Comparative Text
Mining. KDD 2004.
Goal: Extract evolutionary theme patterns from time labeled collection
Applications: News summarization, literature analysis, opinion monitoring, etc.Theme Evolution Theme Evolution Graph and Graph and threads of threads of Tsunami data setTsunami data set
Immediate Reports
Statistics of Death and loss
Personal Experience of Survivors
Statistics of further impact
Aid from Local Areas Aid from the world
Donations from countries
Specific Events of Aid…
Lessons from Tsunami
Research inspired
Time
Doc1Doc3 Doc ..
Theme spans Evolutionary transitions
Theme evolution thread
Theme 1
Theme k
Theme 2
…
Background B
warning 0.3 system 0.2..Aid 0.1donation 0.05support 0.02 ..
statistics 0.2loss 0.1dead 0.05 ..
Is 0.05the 0.04a 0.03 ..
Document d
k
1
2
BB
W
d,1
d, k
1 - Bd,2
“Generating” word w in doc d in the collection
Tt1 … t2
A
C?
B?microarray 0.2gene 0.1protein 0.05
web 0.3classification 0.1topic 0.1
Information 0.2topic 0.1 classification 0.1text 0.05
Evolutionary Transition
Theme similarity
= Themes life cycles of Themes life cycles of KDD AbstractsKDD Abstracts
0
0. 002
0. 004
0. 006
0. 008
0. 01
0. 012
0. 014
0. 016
0. 018
0. 02
1999 2000 2001 2002 2003 2004Time (year)
Nor
mal
ized
Stre
ngth
of T
hem
e
Biology Data
Web Information
Time Series
Classification
Association Rule
Clustering
Bussiness
gene 0.0173expressions 0.0096probability 0.0081microarray 0.0038…
rules 0.0142association 0.0064support 0.0053…
Themes life cycles from Themes life cycles from CNN news datasetCNN news dataset
The Collection
Decoding Decoding CollectionCollection
……
θθ11 θθ22
θθ33
BB
output probability P (w|θ)=
w ww ww ww ww ww ww ww ww ww
Reference:
Q. Mei and C. Zhai. Discovering Evolutionary Theme Patterns from Text -- An Exploration of Temporal Text Mining. KDD 2005.
Goal: model the spatiotemporal theme patterns from a collection of text.
model the mixture of topics: common themes
spatiotemporal content analysis: theme life cycles, theme coverage snapshots
Applications: Weblog mining, search result summarization, opinion tracking, business intelligence, etc.
1 i k Themes
Spatiotemporal Context
Time = t; Location = l
B
Background
Word w
d
Document d at time t and location l
……
B
1 - B
TL 1 - TL
P(i|t,l) P(i|d)
P(w|i)
P(w|B)
Spatiotemporal model:
Compute theme life cycles:
Compute theme snapshots:
Reference: Q. Mei, C. Liu, H. Su, and C. Zhai, A Probabilistic Approach to Spatiotemporal Theme
Pattern Mining on Weblogs. WWW 2006.
Sample results:
Sample results (Weblog data about “Hurricane Katrina”, 5 weeks, U.S.):
Models:
Cluster 1 Cluster 2 Cluster 3
Common
Theme
united 0.042nations 0.04…
killed 0.035month 0.032deaths 0.023…
…
Iraq
Theme
n 0.03Weapons 0.024Inspections 0.023…
troops 0.016hoon 0.015sanches 0.012…
…
Afghan
Theme
northern 0.04alliance 0.04kabul 0.03taleban 0.025aid 0.02…
taleban 0.026rumsfeld 0.02hotel 0.012front 0.011…
…
The common theme indicates that
“United Nations” is involved in both
wars
Collection-specific themes indicate different roles of “United Nations” in the two wars
The first 2 weeks are mostly about “aid from the world”
The next 2 weeks are mostly about “personal experience”Dropping
Rising
Week4: The theme is again strong along the east coast and the Gulf of Mexico
Week3: The theme is distributed more uniformly over the states
Week2: The discussion moves towards the northern and western states
Week5: The theme fades out in most states
Week1: The theme is the strongest along the Gulf of Mexico
k
jjB ltdwpBwPltdwp
1
),,|,()1()|(),,:(
Ttj
jj
ltpltp
ltpltpltp
~)
~,~()
~,~|(
)~
,()~
,|()
~,|(
Ll
k
jj
jj
ltpltp
ltpltptlp
~1'
' )~
,~()~
,~|(
),~(),~|()~|(
top related