automatic generation of event summaries using microblog streams
TRANSCRIPT
“Twitsum” : Automatic generation of event summaries using microblog streams
P.K.K.Madhawa2012MCS044
Motivation - The problem with Twitter search● Twitter ranks tweets based on
user interaction with them. (number of retweets, favorites)
● Top results for the query ‘Ebola’ (25th November 2014)
● How to distinguish newsworthy tweets drowned in a sea of noise
Goal● Distinguish newsworthy tweets based on syntactic features
without depending on manual annotations
● Group tweets discussing the similar content together
Contributions● A heuristic based scheme for annotating tweets as
subjective/objective
● A classifier capable of detecting objective tweets using only the syntactic information of tweets
● An entity-centric tweet clustering algorithm
Twitter summarization - Earlier approaches
Sub-event detection based methods● Use of a Hidden Markov Model to detect sub-events during an American football
match (D.Chakrabarti and K.Punera, 2011)● Sub-event detection by identifying outlier peaks in the temporal distribution of
tweets on a topic. (Zubiaga et al., 2012)
Clustering based approaches ● A support platform for event detection using social intelligence (T.Baldwin, P.
Cook and B.Han, 2012) ○ Tweets are filtered using manually selected keywords
Design
● Tweet storage - stores the set of tweets downloaded using streaming API
● Classifier - selection of objective tweets
● Summarizer - removes duplicates and clusters the tweets based on their similarity
Design - Objectivity detection
● Tweets are periodically downloaded by querying the public timeline using Streaming API
● Structure of a tweet object:
tweet text, user name, created time, geo location, language code, favorite count, retweeted_status, retweet count
Data collection● Training data annotated using a heuristic
measure
● Objective - If the tweet is generated by a verified profile
● Subjective - Tweets containing at least a single emoticon or an emoji character
Preprocessing● All emoticons and emoji characters
are removed from the corpus● User mentions are replaced with the
tag ‘MENTION’ (eg: “@john said this” converts to “MENTION sad this”)
● Punctuation symbols including the pound(#) character are removed.
● Urls are replaced with the tag ‘URL’ (eg: http://t.co/12d3 converts to URL)
● Numbers in a tweet are replaced by the tag ‘NUMERIC’
● Remove stop words
Feature extraction● Tweets are tokenized using TweetNLP
tokenizer (K. Gimpel, N. Schneider, and B. O’Connor, 2011)
● Words are stemmed using Porter stemmer● Stemmed unigrams, bigrams converted to
binary Tf-Idf values (with Laplace smoothing)
● binary feature - presence of slang words (using an external gazetteer)
● binary feature - presence of bad words● Unigrams, bigrams and trigrams of POS
tags as binary Tf-Idf values● Average number of misspelled words● Average number of all-capital words● Average number of hashtags
Classifier selection
● A dataset of 6,000 tweets on Ebola is used to benchmark three classifiers (3,000 tweets from each class)○ Support Vector Machines○ Logistic Regression○ Naive Bayes
● Classifiers trained on a random sample of 4800 tweets and remaining used as the test set.
● Classifier parameters are found using 10-fold cross validation
Classifier performance● SVM was selected because it had higher recall than Logistic Regression● A higher recall results in a larger fraction of newsworthy tweets being detected
Contribution from features● Measured using ablation test● Features divided into three sets
WRD - unigram and bigramsLEX - all other lexical features
Selection of the POS-tagger● NLTK POS tagger● Stanford tagger with GATE twitter model (L. Derczynski et al., 2013)● SENNA tagger (Ronan Collobert, 2011) - “deep” recurrent convolutional neural
network based discriminant parser
Eg:"Last US Ebola Patient Is Cured: Dr. Craig Spencer To Be Released… http://t.co/92JfMm2LaN | http://t.co/NoFij4iACl #news"
NLTK tagger:
[('Last', 'JJ'), ('US', 'NNP'), ('Ebola', 'NNP'), ('Patient', 'NNP'), ('Is', 'NNP'), ('Cured', 'NNP'), ('Dr', 'NNP'), ('Craig',
'NNP'), ('Spencer', 'NNP'), ('To', 'NNP'), ('Be', 'NNP'), ('Released', 'NNP'), ('\u2026', 'NNP'), ('|', 'NNP'), ('news',
'NN')]
Selection of the POS tagger..."Last US Ebola Patient Is Cured: Dr. Craig Spencer To Be Released… http://t.co/92JfMm2LaN | http://t.
co/NoFij4iACl #news"
SENNA tagger:
[('Last', 'JJ'), ('US', 'NNP'), ('Ebola', 'NNP'), ('Patient', 'NNP'), ('Is', 'VBZ'), ('Cured', 'VBN'), ('Dr', 'NNP'),
('Craig', 'NNP'), ('Spencer', 'NNP'), ('To', 'TO'), ('Be', 'VB'), ('Released', 'VBN'), ('\u2026', 'JJ'), ('|', 'NN'), ('news',
'NN')]
Stanford tagger with Gate twitter model:
[('Last', 'JJ'), ('US', 'NNP'), ('Ebola', 'NNP'), ('Patient', 'NN'), ('Is', 'VBZ'), ('Cured', 'VBN'), ('Dr', 'NNP'), ('Craig',
'NNP'), ('Spencer', 'NNP'), ('To', 'TO'), ('Be', 'VB'), ('Released', 'VBN'), ('\u2026', '.'), ('|', ':'), ('news', 'NN')]
ResultsData sets● 1 million tweets containing the term ‘Ebola’
● 22,250 tweets related to the fifth Sri Lanka vs India ODI cricket match held on 16th November (objective- 465, subjective- 878)
○ Filtered using terms “SLvIND”, “SLvsIND”, “INDvSL” and “INDvsSL”.
● 6,800 tweets related to the fourth Sri Lanka vs England ODI cricket match held on 7th December (objective- 215, subjective- 242)
○ Filtered using terms “SLvENG”, “SLvsENG”, “ENGvSL” and ENGvsSL”.
Gold standard data set● A sample 500 tweets on the topic ‘ebola’ is annotated manually as objective or
subjective (objective- 206, subjective- 294)
● Classifier scores on this data
● Errors:“RT @TheDailyEdge: UPDATE: Obama has reduced the US deficit by 70% and Ebola cases in the
US by 100%.” It’s hard to judge the objectivity of such sentences only based on syntactical information.
Comparison with prior research● Event related tweets detection with user type recognition (L.Silva, E.Rillof, 2013)
○ A set of 6,000 tweets on disease outbreaks manually labeled using Amazon Mechanical Turk
● Twitter Sentiment Classification using Distant Supervision (A.Go, R.Bhayani and L.huang, 2013)
○ An SVM model trained on syntactic features used for sentiment classification
Classifier Precision Recall F1-score
User type agnostic classifier 83.15 55.99 66.92
User type specific classifier 80.35 66.07 72.15
Features Accuracy
Unigram + Bigram 81.6
Unigram + POS 81.9
Cross-domain applicability● The classifier trained on Ebola tweets applied on cricket related tweets
● The classifier trained on SLvIndia match performed well on SLvEngland tweets well
Summarizer
● Duplicates and near-duplicate tweets are abundant due to Retweets and tweets generated by ‘Tweet’ buttons on news sites
● Removes duplicates in the objective tweets detected by the classifier
● Tweets discussing the same entities are clustered together
● Objective tweets are stripped of following symbols ‘RT’, ‘@-mentions’ and punctuation
● Jaccard similarity of tokens used to detect duplicate tweets
● Two tweets are considered similar if their Jaccard similarity is greater than a threshold d
Near-duplicate removal
Clustering● The goal is to cluster tweets mentioning the same entities together
Eg: “#Miami #News NYC Doc Free of Ebola: Sources: Dr. Craig Spencer, the physician being treated for Ebola at Belle... http://t.co/iXSUk4axVV”
“#Ebola so the good doctor Craig Spencer will go home - well - the nurse too free to roam but lest we forget 3 countries still suffer deeply”
● Vectors of NER tags converted to Tf-Idf scores and cosine value is selected as the distance measure among two NER tag vectors
● DBSCAN is selected because the number of clusters is not required and it is capable of identifying arbitrary shaped clusters
Clustering - results● SVM classifier trained on ebola-3000 data set is applied on a corpus of 24,038
unseen tweets retrieved on a single day (11-11-2014)
● 13,380 tweets detected as objective and 8,138 as duplicates among them. Clustering resulted in 332 clusters while 2751 tweets labeled as noise
● Clusters depend on the quality of Named Entity Recognizer
Entities: ['Craig', 'Ebola', 'Patient', 'Spencer', 'US']
Clustering - discussion● In contrast this tweet labeled as noise
“‘#Ebola Ebola Outbreak: US Free of Virus After New York Doctor Craig Spencer Cleared - International Business Times UK”
entities - ['Business', 'Craig', 'Ebola', 'Free', 'International', 'New', 'Outbreak', 'Spencer' 'Times', 'US' 'Virus' 'York']
Future work● Improve cross-domain applicability
○ Finding better features with less dependence on the domain
● A better methodology to evaluate summaries
● Improve clustering to consider verbs also
● Generate an abstractive summary○ Generate novel sentences from the information contained in tweets
● Generate summaries realtime