m. Özgür cingiz assoc. prof. dr. banu dİrİ yıldız technical university computer engineering...
TRANSCRIPT
Content Mining of Microblogs
M. Özgür Cingiz Assoc. Prof. Dr. Banu DİRİYıldız Technical University
Computer Engineering Department
ContentAim & ScopeSocial Networks and MicrobloggersRelated WorksProposed System
Training PhaseTesting Phase
Experimental Results and Discussion
Aim & ScopeMicrobloggers’ contents are evaluated with
respect to how they reflect their categoriesCategory information of microbloggers is
foreknownUsers’ contents are also checked according
to up-to-dateness by using RSS news feedsTwo types of users’ contents are used as test
dataContents of Normal Users Contents of Bots
Social Networks After the emergence of Web 2.0 concept,
people cannot be regarded as simple content readers since they can also contribute content as writers
Web 2.0 introduces concepts such as social network, blogs and microblogs
User share his/her opinion, feeling, images, favorite videos and other user contributions as microblog content
Social Networks Users keep in touch with one another in
social networksTheir contents and field of interests connects
users to each other Microblogs are one of the most popular social
network areas (Twitter, Tumblr,identi.ca,Jaiku)
Microblogs has a limitation of characters for content
MicroblogsIn this work we utilize Twitter dataUser content is known as tweet in TwitterA tweet has a limitation of 140 characters According to 2012 statistics
It has 465 million registered users175 million tweets are generated by users in
each day Enormous amount of raw data is very
attractive for researchers
Related WorksEce extracted word-hashtag, user-hashtag
and word-user relations from tweets to discover users’ common interest area
Emre utilized content of normal user and content of bots in his work. He discovered that contents of bots are more categorical than content of normal users.
Duygu extracted categorical features from tweets of 150 users to build up social network by using these features
Related WorksOkay examined patterns in 1250 news and he
discovered that %80 of news contain N-V-N pattern. He look for this pattern in tweets for discovering news in tweets
Baris used microbloggers’ contents and text classification techniques to measure convenience of users’ categories.
What does user want in Twitter ?Users follow other users according to their
field of interestA follower anticipates other users to enter
contents about their categoryA follower anticipates other users to enter
recent contents about their categoryThis works intend to determine how users
reflect their categoryContents are also evaluated according to up-
to-dateness
Proposed System Structure
Proposed System StructureWhy do we choose RSS News Feeds as training
data ?News providers like BBC,CNN and others supply
category of RSS News FeedsWe want to investigate up-to-date tweets so we
look for current tweets by using RSS feedsRSS News Feeds summary of the news so we can
get important few terms and eliminate less distinctive terms from the news
RSS News Feeds has more reliable content than tweets. Tweets may not be as informative as RSS News Feeds
RSS News FeedsRSS (Rich Site Summary) is a Web Feed
formatRSS document is an XML file that contains a
number of discrete news items
RSS News Feeds
Training PhaseWe used 2105 RSS News Feeds in training
phaseFour different categories are taken to form
trainig modalThese categories are: Sports, Technology,
Economy and Entertainment543 Rss feeds for Sports470 Rss feeds for Technology548 Rss feeds for Economy544 Rss feeds for Entertainment
Training Phase PreprocessingFirst we remove punctuation of RSS News feedsSecond step is tokenization step
In this study words are used as tokens (terms)In previous text classification works features are
evaluated separately according to their linguistic labels. Nouns and Verbs are obtained more distinctive so we used only Nouns and Verbs to decrease the size of feature space
Same terms can be used as different formats so lemmatization is used for all termsDrink, drank , drunk ----- drink
Elimination of stop words that has no distinctiveness for classification
Training PhaseTF-IDF weighting method is applied for all termsAfter all preprocessing steps, training model
contains 7212 features so feature reduction is necessary for feature set
We used 2 different feature selection methods to specify the best feature subset.
These are Information Gain and Chi Square Statistics
With using Weka toolWe tried different threshold values for feature
selection phase
Training PhaseThe highest F- Measure value is 95.2% by using
Chi- Square Statistics as feature selection method and Multinominal Naive Bayes as classifier
7212 features is reduced to 1277 features by using feature selection methods
Only these 1277 features are used in study as feature set that are gathered from training data
After all these steps, Multinominal Naive Bayes and Suppor Vector Machines are used as classifier for classification
Training Phase
Feature Set of Proposed System
Testing PhaseAfter forming training modal, tweets of 26 normal users
and tweets of 27 bots are used as test data (6671 tweets for testing phase)
Category information of Twitter users are obtained from wefollow.com application (we get same categories: sports, entertainment, technology and economy for classification)
How can we know that a user is bot or normal user? After examination of user tweets, if a user contancts with other
users ,we categorize user as normal user otherwise we categorize user as bot# of Normal
Users# of Bots
Sports 7 11
Entertainment
7 5
Technology 6 5
Entertainment
7 5
Testing Phase
Testing PhaseRemoval of punctuations, tokenization, and
selection of features in terms of their linguistic information, stemming and elimination of stop words are preprocessing steps that we used for tweets too
Hyperlinks of images and videos are also eliminated from tweets
Testing PhaseWe want to check up-to-dateness of tweets about their
category soAfter all preprocessing applied for tweets, every word in
tweets is not considered as feature for checking up-to-dateness of tweets
If a word in tweets is not in training feature set, this word is eliminated
We look for features that are obtained from training feature set in tweets
SoWe can eliminate abbrevations and meaningless words in
tweetsWe can check up-to-dateness of tweets ( according to current
news)
Testing PhaseAfter feature selection part, TF-IDF weighting
is applied for all terms A tweet has 140 character limitation so a
tweet doesn’t consist of a lot of wordsAfter all preprocessing steps and feature
selection criteria, some tweets become featureless or less features soWe specified three term count threshold values
Testing PhaseThese three term count threshold values are
>2 (greater than two): Tweets must have at least 2 terms.
>3 (greater than three): Tweets must have at least 3 terms.
>4 (greater than four): Tweets must have at least 4 terms.
These three different test data sets are used separetly in testing phase
Testing Phase# OF USER TWEETS
Term Count Threshold Values
>2 >3 >4
# of tweets of Normal Users
627 285 107
# of tweets of Bots 1056 473 197
For both user types, number of tweets decrease when term count threshold value increases
Testing PhaseIn testing phase
2 classifiers SVMs and MNNB
3 different term count threshold values >2, >3 and >4
2 different types of user tweets Tweets of bots and tweets of normal users
F-measure is used for evaluation of classification performance
Experimental Results & DiscussionTERM COUNT THRESHOLD VALUES
MNNB||SVMsF-Measure
>2 >3 >4
Tweets of Bots 0.692 0.866 0.737 0.909 0.78 0.937
Tweets of Normal Users
0.633 0.777 0.671 0.804 0.757 0.868
• Classification performance of bots’ tweets are higher than classification performance of normal users’ tweets• Tweets of bots are more categorical than tweets of normal users• MNNB outperforms SVMs in terms of classification performance at each threshold value• Classification performances of tweets increase when term count threshold value increases (it is valid for tweets of both user types )• It proves that tweets which have more terms gives better results (it
is valid for tweets of both user types )
ConclusionIn this study
We want to evaluate how normal users and bots reflect their categories
We used RSS News Feeds to check users content is uptodated or not
We examined classification results and these results give that content of bots reflect their categories more than content of normal users and also tweets of bots are more updated than tweets of normal users.
29
References Aslan, O., “Revealing An Analysis of News On Microblogging Systems”, Master
Tezi, Boğaziçi Üniversitesi, 2010 Shamma, D. A., L. Kennedy, and E.F. Churchill, “Tweet the debates: understanding
community annotation of uncollected sources”, WSM’09: Proceedings of first SIGMM workshop ons Social media, pp. 3-10, ACM, New York, NY, USA, 2009
Akman, D. S., Revealing “Microblogger Interests By Analyzing Contributions”, Master Tezi, Boğaziçi Üniversitesi, 2010
Yurtsever, E., “Sweettweet: A Semantic Analysis For Microblogging Environments”, Master Tezi, Boğaziçi Üniversitesi, 2010
Vieweg, S., A. L, Hughes, K.Starbird, and L. Palen, “Microblogging during two natural hazards events: what twitter may contribute to situational awareness.", Mynalt, E. D., D. Schoner, G. Fitzpartrick, S. E. Hudson, K. Edwards, and T. Rodden(editors), CHI, pp. 1079-1088, ACM, 2010.
Güç, B., “Information Filtering on Micro-blogging Services”, Master’s Thesis, Swiss Federal Institute of Technology Zürich, 2010
Leopold, E., and Kindermann, J.,”Text categorization with support vector machines. How to represent texts in input space?”, Machine Learning 46, pp.423-444, 2002
Yang, Y., ve Liu,X., “A Re-Examination of Text Categorization Methods”, In Proc 22nd Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval,42-49, 1999
30
Thank You