m. Özgür cingiz assoc. prof. dr. banu dİrİ yıldız technical university computer engineering...

Content Mining of Microblogs

M. Özgür Cingiz Assoc. Prof. Dr. Banu DİRİYıldız Technical University

Computer Engineering Department

ContentAim & ScopeSocial Networks and MicrobloggersRelated WorksProposed System

Training PhaseTesting Phase

Experimental Results and Discussion

Aim & ScopeMicrobloggers’ contents are evaluated with

respect to how they reflect their categoriesCategory information of microbloggers is

foreknownUsers’ contents are also checked according

to up-to-dateness by using RSS news feedsTwo types of users’ contents are used as test

dataContents of Normal Users Contents of Bots

Social Networks After the emergence of Web 2.0 concept,

people cannot be regarded as simple content readers since they can also contribute content as writers

Web 2.0 introduces concepts such as social network, blogs and microblogs

User share his/her opinion, feeling, images, favorite videos and other user contributions as microblog content

Social Networks Users keep in touch with one another in

social networksTheir contents and field of interests connects

users to each other Microblogs are one of the most popular social

network areas (Twitter, Tumblr,identi.ca,Jaiku)

Microblogs has a limitation of characters for content

MicroblogsIn this work we utilize Twitter dataUser content is known as tweet in TwitterA tweet has a limitation of 140 characters According to 2012 statistics

It has 465 million registered users175 million tweets are generated by users in

each day Enormous amount of raw data is very

attractive for researchers

Related WorksEce extracted word-hashtag, user-hashtag

and word-user relations from tweets to discover users’ common interest area

Emre utilized content of normal user and content of bots in his work. He discovered that contents of bots are more categorical than content of normal users.

Duygu extracted categorical features from tweets of 150 users to build up social network by using these features

Related WorksOkay examined patterns in 1250 news and he

discovered that %80 of news contain N-V-N pattern. He look for this pattern in tweets for discovering news in tweets

Baris used microbloggers’ contents and text classification techniques to measure convenience of users’ categories.

What does user want in Twitter ?Users follow other users according to their

field of interestA follower anticipates other users to enter

contents about their categoryA follower anticipates other users to enter

recent contents about their categoryThis works intend to determine how users

reflect their categoryContents are also evaluated according to up-

to-dateness

Proposed System Structure

Proposed System StructureWhy do we choose RSS News Feeds as training

data ?News providers like BBC,CNN and others supply

category of RSS News FeedsWe want to investigate up-to-date tweets so we

look for current tweets by using RSS feedsRSS News Feeds summary of the news so we can

get important few terms and eliminate less distinctive terms from the news

RSS News Feeds has more reliable content than tweets. Tweets may not be as informative as RSS News Feeds

RSS News FeedsRSS (Rich Site Summary) is a Web Feed

formatRSS document is an XML file that contains a

number of discrete news items

RSS News Feeds

Training PhaseWe used 2105 RSS News Feeds in training

phaseFour different categories are taken to form

trainig modalThese categories are: Sports, Technology,

Economy and Entertainment543 Rss feeds for Sports470 Rss feeds for Technology548 Rss feeds for Economy544 Rss feeds for Entertainment

Training Phase PreprocessingFirst we remove punctuation of RSS News feedsSecond step is tokenization step

In this study words are used as tokens (terms)In previous text classification works features are

evaluated separately according to their linguistic labels. Nouns and Verbs are obtained more distinctive so we used only Nouns and Verbs to decrease the size of feature space

Same terms can be used as different formats so lemmatization is used for all termsDrink, drank , drunk ----- drink

Elimination of stop words that has no distinctiveness for classification

Training PhaseTF-IDF weighting method is applied for all termsAfter all preprocessing steps, training model

contains 7212 features so feature reduction is necessary for feature set

We used 2 different feature selection methods to specify the best feature subset.

These are Information Gain and Chi Square Statistics

With using Weka toolWe tried different threshold values for feature

selection phase

Training PhaseThe highest F- Measure value is 95.2% by using

Chi- Square Statistics as feature selection method and Multinominal Naive Bayes as classifier

7212 features is reduced to 1277 features by using feature selection methods

Only these 1277 features are used in study as feature set that are gathered from training data

After all these steps, Multinominal Naive Bayes and Suppor Vector Machines are used as classifier for classification

Training Phase

Feature Set of Proposed System

Testing PhaseAfter forming training modal, tweets of 26 normal users

and tweets of 27 bots are used as test data (6671 tweets for testing phase)

Category information of Twitter users are obtained from wefollow.com application (we get same categories: sports, entertainment, technology and economy for classification)

How can we know that a user is bot or normal user? After examination of user tweets, if a user contancts with other

users ,we categorize user as normal user otherwise we categorize user as bot# of Normal

Users# of Bots

Sports 7 11

Entertainment

7 5

Technology 6 5

Entertainment

7 5

Testing Phase

Testing PhaseRemoval of punctuations, tokenization, and

selection of features in terms of their linguistic information, stemming and elimination of stop words are preprocessing steps that we used for tweets too

Hyperlinks of images and videos are also eliminated from tweets

Testing PhaseWe want to check up-to-dateness of tweets about their

category soAfter all preprocessing applied for tweets, every word in

tweets is not considered as feature for checking up-to-dateness of tweets

If a word in tweets is not in training feature set, this word is eliminated

We look for features that are obtained from training feature set in tweets

SoWe can eliminate abbrevations and meaningless words in

tweetsWe can check up-to-dateness of tweets ( according to current

news)

Testing PhaseAfter feature selection part, TF-IDF weighting

is applied for all terms A tweet has 140 character limitation so a

tweet doesn’t consist of a lot of wordsAfter all preprocessing steps and feature

selection criteria, some tweets become featureless or less features soWe specified three term count threshold values

Testing PhaseThese three term count threshold values are

>2 (greater than two): Tweets must have at least 2 terms.

>3 (greater than three): Tweets must have at least 3 terms.

>4 (greater than four): Tweets must have at least 4 terms.

These three different test data sets are used separetly in testing phase

Testing Phase# OF USER TWEETS

Term Count Threshold Values

>2 >3 >4

# of tweets of Normal Users

627 285 107

# of tweets of Bots 1056 473 197

For both user types, number of tweets decrease when term count threshold value increases

Testing PhaseIn testing phase

2 classifiers SVMs and MNNB

3 different term count threshold values >2, >3 and >4

2 different types of user tweets Tweets of bots and tweets of normal users

F-measure is used for evaluation of classification performance

Experimental Results & DiscussionTERM COUNT THRESHOLD VALUES

MNNB||SVMsF-Measure

>2 >3 >4

Tweets of Bots 0.692 0.866 0.737 0.909 0.78 0.937

Tweets of Normal Users

0.633 0.777 0.671 0.804 0.757 0.868

• Classification performance of bots’ tweets are higher than classification performance of normal users’ tweets• Tweets of bots are more categorical than tweets of normal users• MNNB outperforms SVMs in terms of classification performance at each threshold value• Classification performances of tweets increase when term count threshold value increases (it is valid for tweets of both user types )• It proves that tweets which have more terms gives better results (it

is valid for tweets of both user types )

ConclusionIn this study

We want to evaluate how normal users and bots reflect their categories

We used RSS News Feeds to check users content is uptodated or not

We examined classification results and these results give that content of bots reflect their categories more than content of normal users and also tweets of bots are more updated than tweets of normal users.

29

References Aslan, O., “Revealing An Analysis of News On Microblogging Systems”, Master

Tezi, Boğaziçi Üniversitesi, 2010 Shamma, D. A., L. Kennedy, and E.F. Churchill, “Tweet the debates: understanding

community annotation of uncollected sources”, WSM’09: Proceedings of first SIGMM workshop ons Social media, pp. 3-10, ACM, New York, NY, USA, 2009

Akman, D. S., Revealing “Microblogger Interests By Analyzing Contributions”, Master Tezi, Boğaziçi Üniversitesi, 2010

Yurtsever, E., “Sweettweet: A Semantic Analysis For Microblogging Environments”, Master Tezi, Boğaziçi Üniversitesi, 2010

Vieweg, S., A. L, Hughes, K.Starbird, and L. Palen, “Microblogging during two natural hazards events: what twitter may contribute to situational awareness.", Mynalt, E. D., D. Schoner, G. Fitzpartrick, S. E. Hudson, K. Edwards, and T. Rodden(editors), CHI, pp. 1079-1088, ACM, 2010.

Güç, B., “Information Filtering on Micro-blogging Services”, Master’s Thesis, Swiss Federal Institute of Technology Zürich, 2010

Leopold, E., and Kindermann, J.,”Text categorization with support vector machines. How to represent texts in input space?”, Machine Learning 46, pp.423-444, 2002

Yang, Y., ve Liu,X., “A Re-Examination of Text Categorization Methods”, In Proc 22nd Ann Int ACM SIGIR Conference on Research and Development in Information Retrieval,42-49, 1999

30

Thank You

m. Özgür cingiz assoc. prof. dr. banu dİrİ yıldız technical university computer engineering...

Documents

news rss news feeds

category of rss news

punctuation of rss news

classification slide

feature selection phase

entertainment slide

dateness slide

microblog content slide