computational framework for generating visual summaries of topical clusters in twitter streams
TRANSCRIPT
Computational Framework for Generating Visual Summaries of
Topical Clusters in Twitter Streams*
Authors: Presenter: !Miray Kas Sebastian Alfers - HTW Berlin Bongwon Suh
1
Semantic Modeling
* http://link.springer.com/chapter/10.1007%2F978-3-319-02993-1_9
Visual Summaries of Twitter Streams
2
http://flowingdata.com/wp-content/uploads/2010/02/treemap-revised1.gif
http://www.infobarrel.com/media/image/54054.jpg
Step 1:get &
pre-process Data
construct graph & clustering
extract keywords & summarize
Keywords
Stream Tweets
Preprocessing/ Cleaning
Construct GraphClustering
Select Relevant Clusters Extract Topical
Keywords
Visual Cluster Summary
Step 2:
Step 3:
3
Step 1: Preprocessing• transform Tweets
- easy-to-analyze / clan format
• Process of cleaning: 1. lowercase 2. remove urls, user mentions and stop words
• like @user, „a“ or „123“ 3. remove special characters (#,.)
8
Step 1: Preprocessing• Example Keywords:
- SCALA - Scala - scala - #scala
• Ling Pipe Library* - remove tense and plurals
9
} scala
*http://alias-i.com/lingpipe/
Step 1: Preprocessing• Example Tweets
10
new york time reactive
programming tool scala scale
techrepublic
akka-http based reactive stream scala scaladay
Step 1: Preprocessing• Example Tweets
11
new york time reactive
programming tool scala scale
techrepublic
akka-http based reactive stream scala scaladay
Step 2: Graph• Word Co-Occurrence Graph
- Word = Node (Unigrams) - Tweet = Link between Nodes
• Example
12 *http://alias-i.com/lingpipe/
akka-http based reactivestream scala scaladay
Step 2: Graph• Word Co-Occurrence Graph
- Word = Node (Unigrams) - Tweet = Link between Nodes
• Example
13 *http://alias-i.com/lingpipe/
akka-http based reactivestream scala scaladay
Step 2: Graph• Word Co-Occurrence Graph
- Word = Node (Unigrams) - Tweet = Link between Nodes
• Example
14 *http://alias-i.com/lingpipe/
akka-http
basedreactivestream
scalascaladay
Step 2: Graph• Word Co-Occurrence Graph
- Word = Node (Unigrams) - Tweet = Link between Nodes
• Example
15 *http://alias-i.com/lingpipe/
akka-http
basedreactivestream
scalascaladay
NodesNodes
NodesLinks
Step 2: Graph• Word Co-Occurrence Graph
- Word = Node (Unigrams) - Tweet = Link between Nodes
• Example
16 *http://alias-i.com/lingpipe/
akka-http
basedreactivestream
scalascaladay
Step 2: Graph• Co-Occurrence Graph
- connect nodes (words) within and between tweets
- add strength (weight) and cost (distance)
• More frequently words - increase the strength - decrease cost
19
Step 2: Clustering• Here: „complete link (max) clustering“ algorithm
- hierarchical clustering algorithm that forms clusters by merging subgroups
• Group Words from Tweets - frequently appear on topic - cluster = topic
* http://nlp.stanford.edu/IR-book/html/htmledition/single-link-and-complete-link-clustering-1.html
Step 2: Clustering• Here: „complete link (max) clustering“ algorithm
• each node starts as individual cluster
!
• close clusters are successively merged together - close = highest cost within clusters
Clusters = Nodes = Words in tweet
22
Step 2: Clustering
reactive
scalabased
stream
…
reactive
scalabased
stream
…
23
cost = distance = 0.5
cost = distance = 1
1
1
Graph Representation Cluster Representation
Step 2: Clustering• Final step: Dendrogram
- tree diagram - represents the arrangement of hierarchical clusters
• why? - easy to apply thresholds metics
30
Step 2: Clustering• Final step: Dendrogram
- closer to the root = lower similarity
31
root
reactive scalafirst cluster
Step 2: Clustering• Final step: Dendrogram
- closer to the root = lower similarity
32
root
reactive scala
new york programming … akka-http based stream scaladay
Step 2: Clustering• Final step: Dendrogram
- closer to the root = lower similarity
33
root
reactive scala
new york programming … akka-http based stream scaladay
thresholds
Step 3: Extract topical keywords
35
Preprocessing/ Cleaning
Construct Graph
Extract Topical Keywords
Step 3: Extract topical keywords• keywords
- express a topic - frequently used - summarize tweets content
• Questions - „What are the relevant keywords?“ - „In what clusters do they appear?“
36
Step 3: Extract topical keywords• How?
- „topical tweets“ vs. „general tweets“
• frequently in topical tweets!- search keywords „reactive scala“!
• not frequently in general tweets!- general twitter stream (all tweets)
37
Step 3: Extract topical keywords• Strength of a word
- is a word relevant for that topical cluster?
38
Low Frequency
High Frequency
Low Frequency
High Frequency
Topical Tweets
Gen
eral
Tw
eets
Step 3: Extract topical keywords• Strength of a word
- is a word relevant for that topical cluster?
39
Low Frequency
High Frequency
Low Frequency
High Frequency
Topical Tweets
Gen
eral
Tw
eets ✔
relevant for topic / cluster
Step 3: Extract topical keywords• Result
- topical strength for each keyword - sort them by relevancy - select top 20 keyword
• choose clusters that contain this words
40