twitter analysis

18
Performing sentiment analysis on Twitter data (2011 Norway attacks) Team – Aparna Dhanashri Jayaprakash – 50094768 Himanshu Yadav – 50093151 Inder Puneet Singh – 50094241 Sabah Abdul Mannan Khan – 50094894 Vidya Mulukutla - 50095830

Upload: himanshu-yadav

Post on 28-Jul-2015

84 views

Category:

Data & Analytics


1 download

TRANSCRIPT

Page 1: Twitter analysis

Performing sentiment analysis on Twitter data (2011 Norway attacks)

Team –

Aparna Dhanashri Jayaprakash – 50094768

Himanshu Yadav – 50093151

Inder Puneet Singh – 50094241

Sabah Abdul Mannan Khan – 50094894

Vidya Mulukutla - 50095830

Page 2: Twitter analysis

Analysis of Twitter Data Set

Introduction

Big Data is increasingly pertinent in today’s digitalized world and is being used in a lot of

different domains. With social media being so pervasive, it makes logical sense to use it to

generate the data sets for analysis in various areas from politics to entertainment. We have

chosen ‘Twitter’ as our source for data since it has a wide user base that includes regular people

as well as popular individuals from the fields of media, movies, sports and politics. There are a

lot of analytical results that can be derived from a popular and widely used Social media

platform like Twitter and we used the data generated from it through an implementation using

Apache Hadoop and Hive. In order to gauge the reactions from the different users who

responded to the significant events in the month of July 2011, we performed a Sentiment

Analysis. Sentiment Analysis is the process of trying to gather subjective information through

natural language processing, computational linguistics and text analysis. It is also known as

opinion mining. There were two important and completely contrasting events that took place in

July 2011 for which we came up with a comparison analysis and the description of the events is

as follows:

The Norway attacks of 2011 were the most deadly attacks on the country. Two sequential

explosions took place within a span of two hours on 22nd July 2011. The first one was a car bomb

that took place in the executive governmental headquarters that killed eight people and injured

around 209 people. The second one was a deadly assault that took place on an island. It was a

summer camp organized by the youth division of the ruling party. An unidentified man gained

access to the camp and open fired at the participating members. This attack claimed 69 lives and

seriously injured 110 persons. The accused in the case, Anders Behring Breivik, was sentenced

to 21 years in imprisonment.

Page 3: Twitter analysis

Analysis of Twitter Data Set

Amy Winehouse was a hugely popular British singer and songwriter. Her work was

critically as well as commercially appreciated and she won multiple Grammy Awards for her

songs. Her sudden demise due to alcohol poisoning on 23rd July 2011 shocked millions of her

fans worldwide and sent the online networking sites into frenzy.

Hypothesis

As per our hypothesis, we decided to evaluate how users from different geographical

locations reacted to both the stories on twitter. We took the assumption that the Norway attacks

would affect the public more as compared to the Amy Winehouse death and would garner more

tweets, hashtags and retweets as it is a more important event in the sense that it was an attack in

which many lives were lost and even more critically injured. We compared these two events

using sentiment analysis.

Technology

For our implementation, we have used Apache Hadoop which was deployed on an Amazon EC2

instance for processing of data. For the installation of Hadoop master, we used m1.1large

instance type whereas for the Hadoop slaves, we used m1.4small instance types. We elected the

M1 general-purpose instance types primarily for their extremely low cost options for running

applications. They are appropriate for a moderately good CPU performance.

Apache Hive was used to analyze, summarize and query the data using a SQL type language

known as HiveQL.

Data Preparation

Data Selection

The data that was extracted was segregated into different tables for the sake of

convenience of analysis. One of the tables from the Norway attacks event is as shown below -

Page 4: Twitter analysis

Analysis of Twitter Data Set

Hashtag

Coun

t

Page 5: Twitter analysis

Analysis of Twitter Data Set

Oslo 466

Norway 396

tcot 308

oslo 244

p2 234

SAVEAMERICA

NOW 214

news 124

blamethemuslims 111

norway 110

breakingnews 93

isles 93

fb 90

islanders 88

cnn 82

Utoya 74

teaparty 61

osloexpl 55

News 55

prayfornorway 55

tlot 36

Breivik 34

socialmedia 34

Page 6: Twitter analysis

Analysis of Twitter Data Set

politics 32

NFL 32

utoya 27

PrayForNorway 27

Utøya 27

CNN 26

Islam 24

oslobomb 24

Data Cleaning:

Contrary to our perception that the data set would be limited to one specific time period

of say one year, the information extracted from the data set spanned over many years due to

which there was no concentration of high density of information in one particular time period.

Firstly, this meant finding events that occurred in a specific time period. Also, considering the

fact that data in the data set is acquired from varied number of sources, there is often a lot of

redundant data, which makes the deletion of duplicate information mandatory before any

analysis can be conducted.

Owing to the fact that we were dealing with huge data sets, we partitioned the data to

make the analysis easier and also to improve query performance. Another important aspect of

Data cleaning is Geo tagging locations. The reason that this needs to be considered is that the

same address can be interpreted in various forms. For example, Bangalore, Bangalore Karnataka

and Bangalore Karnataka India are all different ways to write the same location. In order to

perform an accurate analysis, the location needs to be normalized and converted into the same

Page 7: Twitter analysis

Analysis of Twitter Data Set

format. The technique that we used to do this is Google’s Geocoding API. This API assists by

giving a straightforward method to convert a particular address into coordinates like latitudes and

longitudes that can be applied for map positioning.

Challenges faced during Implementation:

Some of the hindrances that we encountered with the extracted data are:

Duplicate files:

The extracted data returned a huge number of repetitive files with the same content. This

is a huge annoyance, as single files with unique content must be filtered through additional

processing. This is also very time consuming.

Parsing data:

Parsing is a difficult aspect and it does not work owing to varied reasons such as if the

data on Twitter consists of many languages. Another reason could be the that the JSON structure

was closed incorrectly which limits the data read beyond this point.

Complete data not recovered:

This issue deals with the non-recovery of complete data when extracting through Apache

Hive. As we are dealing with huge data sets, a lot of extra programming and debugging is

required to repair the situation. Parsing exceptions were created which were thatched by locating

the erroneous files.

Analysis

After data selection and data cleaning process, different tables were selected that were

representative of various aspects of the analysis with regards to the two events – Amy

Winehouse and Norway attacks ; a comparison analysis for the two events along with a

Page 8: Twitter analysis

Analysis of Twitter Data Set

sentiment analysis for each of the two events. Following are the different aspects which will help

proceed with an analysis of the events in hand –

Data Distribution, Hashtags count table, URLS count table, Tweet sentiment, and

Famous tweeters.

Event 1: Amy Winehouse

7/22/2011

7/23/2011

7/24/2011

7/25/2011

7/26/2011

7/27/2011

7/28/2011

7/29/2011

7/30/2011

7/31/2011

0

5000

10000

15000

20000

25000

No of Tweets

Page 9: Twitter analysis

Analysis of Twitter Data Set

URL Share Count

http://t.co/0IGT940 http://t.co/kLYO5t5

http://huff.to/oDwgHC http://t.co/BtIzsiW

http://t.co/CahfKYh http://on.msnbc.com/4dpW6f

http://nyp.st/qYGM9L http://bit.ly/oapSdd

http://t.co/TkKR8Qm http://n.pr/nnu5XS

RIP

amywin

ehouse

gonetooso

on

AmyW

inehouse

music

news

nowplaying

singer

rip Amy

0

100

200

300

400

500

600

Hashtag Count

Page 10: Twitter analysis

Analysis of Twitter Data Set

SkyNewsBreak

YouTube

BreakingNews

HuffingtonPost

Reuters

NewYorkPost

iamshortymack

RollingStone

HotNewHipHop

mashable

0 50 100 150 200 250 300 350 400 450

User Mention Count

Event 2: Norway attacks

7/22/2011

7/23/2011

7/24/2011

7/25/2011

7/26/2011

7/27/2011

7/28/2011

7/29/2011

7/30/2011

7/31/2011

010002000300040005000600070008000

No of Tweets

Page 11: Twitter analysis

Analysis of Twitter Data Set

7%7%7%5%5%

4%4%4%4%4%3%3%3%3%3%3%3%3%3%3%3%2%2%2%2%2%2%2%2%2%

URL Share Counthttp://on.mash.to/nViorD http://bisi.pl/31b

http://bit.ly http://budurl.com/2tl2

http://t.co/dPHb33j http://bit.ly/qd41UN

http://apne.ws/qvdeXV http://bit

http://t.co/AyS26mV http://twitpic.com/5tzsmx

http://t.co/dXABr5T http://apne.ws/qi7CM5

http://bit.ly/r6qXrY http://bbc.in/oKHzCP

http://t.co/UHz8Y6f http://bit.ly/qQGMEn

http://t.co http://usat.ly/qiuUEr

http://nyti.ms/ok8QFs http://bit.ly/qqAjb7

http://on.wsj.com/r8SJC0 http://qnlink.com/nr

http://bit.ly/pReaVF http://on.mash.to/qPCslP

http://t.co/IdJqY2g http://ti.me/qoySXA

http://on.mash.to/pbbEdm http://bit.ly/reyW5l

Oslo tcot

p2news

norway

isles

islanders

Utoya

osloexpl

prayfo

rnorw

ay

Breivik

politics

utoya

Utøya

Islam

050

100150200250300350400450500

Hashtag Count

Page 12: Twitter analysis

Analysis of Twitter Data Set

BreakingNews

Reuters

CBSNews

YouTube

HuffingtonPost

YahooNews

StateDept

mpoppel

ggreenwald

SenatorSanders

0 50 100 150 200 250 300 350 400 450

User Mention Count

Comparison Analysis

The Amy Winehouse event occurred on 23rd of July,2011 whereas the Norway attacks event

occurred on 22nd July, 2011. As can be seen from the charts, the number of tweets for event 1

peaked on the day of the event and had a steep drop over the week till they finally died down. On

the other hand, the Norway attacks event, had maximum tweets on the day of the event and

subsequently over the next couple of days while the drop in number of tweets was pretty gradual.

However, it is interesting to note that event 1 garnered the maximum number of tweets of over

20000 on the day when it occurred. Despite being of more serious nature, event 2 saw much less

number of tweets on the day of its occurrence.

Sentiment Analysis

The sentiments in terms of positive, negative and neutral tweets to the two events over a span of

a week from 07/22/2011 to 07/31/2011 are visualized. Below are graphs that depict the same –

Page 13: Twitter analysis

Analysis of Twitter Data Set

Event 1: Amy Winehouse

20-Jul-11 22-Jul-11 24-Jul-11 26-Jul-11 28-Jul-11 30-Jul-11 1-Aug-110

2000

4000

6000

8000

10000

12000

Tweet Count

Positive tweet Negative Tweet Neutral Tweet

The Event 1 garnered maximum neutral tweets and minimum positive tweets on the whole.

Event 2: Norway Attacks

Event 2 also garnered maximum neutral tweets and minimum positive tweets on the whole.

Interestingly, the number of negative tweets exceeded the neutral and positive tweets during the

subsequent days of the event.

20-Jul-11 22-Jul-11 24-Jul-11 26-Jul-11 28-Jul-11 30-Jul-11 1-Aug-110

1000

2000

3000

4000

5000

6000

7000

8000

Tweet Count

Positive Negative Neutral

Page 14: Twitter analysis

Analysis of Twitter Data Set

Conclusion

Managing huge amounts of data is becoming convenient with the advent of distributed

file systems. They have the capability of managing and analyzing huge volumes of data that can

help assess a particular event’s significance over a period of time.

The analysis negates the hypothesis that we had initially assumed and brought us to the

conclusion that Amy Winehouse event was as popular as an event as grave as the Norway attacks

if not more. The retweets that the events generated assist in determining the most discussed

issues among the twitter users. It is extremely surprising that a celebrity death can take

precedence over assault of a nation. A reasoning for this could be that people are very conscious

and careful upon commenting on issues that are sensitive in nature and choose to refrain from

expressing views. The sentiment analysis reasserts this; with the graphs showing maximum

neutral tweets to both the events, it can be interpreted that most people are reserved in their

opinions and hence take a neutral stand while participating on a public platform where most

activities are scrutinized especially an issue as delicate as the Norway attacks.

Page 15: Twitter analysis

Analysis of Twitter Data Set

References

http://en.wikipedia.org/wiki/Sentiment_Analysis

http://en.wikipedia.org/wiki/Apache_Hive

http://aws.amazon.com/ec2/instance-types/#selecting-instance-types

https://developers.google.com/maps/documentation/geocoding/?hl=el