topic cluster of streaming tweets based on gpu-accelerated self organizing map group 15 chen zhutian...
TRANSCRIPT
Topic cluster of Streaming Tweets based on GPU-Accelerated Self Organizing Map
Group 15Chen ZhutianHuang Hengguang
Outline
Background
Pipeline and Technique
Conclusion
Background
What happen in the tweets stream?
• Unsupervised, Clustering algorithm.
• Organize large document collections according to textual similarities.
• Create visible result for searching and exploring large document collections.
WEBSOM system
• Based on Self Organizing Map.• Generate topic map for
documents.• Explore large documents just
like explore Google map.
What WEBSOM looks like?
Gap
• WEBSOM – Long document, static, long training time.
• Twitter – Short text, dynamic, streaming data
• How to adapt SOM to streaming Twitter data?
What our system looks like
Outline
Background
Pipeline and Technique
Conclusion
Pipeline and Technique
Pipeline
Detect Event
Build Dictionary
Vectorize Tweets
Reduce Dimension
SOM Cluster
Show the SOM map
Detect Event
Detect Event
• Only focus on unusual events.• How to identify abnormal events on
Twitter?
Tweets Stream
Events
Events
1. Similar to TCP’s congestion control mechanism.
2. Count the number of tweets in a moving window.
3. Weighted moving average and variance.
4. Threshold to determine whether it’s an event.
Detect Event
FIFA2014, Brazil 1:7 Germany
• Track 823 keywords.
such as “FIFA”, ”Ger”, ”Brazil”,
“#WordCup”…
• In 110 minutes.
• 100 million tweets.
• Sample 1%
Test Data
Goal!
Goal! X 3
Goal!
Time of Peak What’s happen?
4:11 First Goal!
4:25 Goal! X 3 in 3 minute
4:30 Goal!
5:07 Second Half Begin
5:25 Goal!
5:35 Goal!
5:46 Goal!
5:50 End!
Detect Event
Build Dictionary
Vectorize Tweets
Reduce Dimension
SOM Cluster
Show the SOM map
Detect Event
Detect Event
Build Dictionary
1. Remove stop words2. Stemming – Snow Balls3. Remove words whose occurrence less that
10%4. Remove words whose occurrence greater
that 50%
Build Dictionary
1. Vector Space model2. TF-IDF3. Normalization
Vectorize Tweets
𝑉 𝑖= (0.4123 ,0.12312 ,0.344 ,… )
10,000 tweets x 10,000 dimension
1+ hour for convergence
Reduce Dimension
Show the SOM map
SOM Cluster
Reduce Dimension
Vectorize Tweets
Build Dictionary
Detect Event
Reduce Dimension
Random Projection
1. No Training.2. Matrix Operation.
Based on Johnson-Lindenstrauss lemma
Show the SOM map
SOM Cluster
Reduce Dimension
Vectorize Tweets
Build Dictionary
Detect Event
SOM Cluster
What is SOM? Self-organization Map.
• Artificial Neural Network
• Unsupervised Learning• Iteration Based• Visible Result
SOM Cluster
• Sequential SOM
• Batch Type SOM Faster, Effective
SOM Cluster
Random Projection+ Batch SOM +
1 SecondHour
SOM Cluster
CUDA
20 newsgroups
• 20,000 documents.
• 20 different newsgroups.
• only in 1 group.
Test Data
http://web.ist.utl.pt/acardoso/datasets/.
60% vs
40%Train
Test
Method Random Projection
Macro Accuracy(
%)
Micro Accuracy(
%)
Renato’s SOM NO 68 67
Our Method YES 60 61
Conclusion: Random projection will result in losing precision. Hence the performance will decrease after dimension reduction.
20 Newsgroup Test
Method Random Projection
Macro Accuracy(%)
Micro Accuracy(%)
Renato’s SOM NO 68 67
Our Method YES 60 61
Matlab repeat Renato’s SOM
NO 63 62
Matlab repeat Renato’s SOM
YES 61 60
We use SOM tool box to repeat Renato’s experiment totally.
20 Newsgroup Test
FIFA Data
FIFA Data
FIFA Data
Conclusion
• 2 algorithms• 3 sets of
experiment• 1 prototype
system• 1 case study
Conclusion
Thanks for Watching
Q & A