group 10 baudm assignment 3

7
WORD CLOUD ANALYSIS Assignment III Ansuman Chattopadhyay - 14PGP003 Praveen Kumar J R - 14PGP057 Robin Singh- 14PGP059 Vaibhav Bhatia- 14PGP100 S. Sreenvas - 14PGP119

Upload: somalkant

Post on 05-Dec-2015

215 views

Category:

Documents


0 download

DESCRIPTION

Word cloud analysis

TRANSCRIPT

Page 1: Group 10 BAUDM Assignment 3

Word Cloud Analysis

Assignment III

Ansuman Chattopadhyay - 14PGP003

Praveen Kumar J R - 14PGP057

Robin Singh- 14PGP059

Vaibhav Bhatia- 14PGP100

S. Sreenvas - 14PGP119

Page 2: Group 10 BAUDM Assignment 3

Text analysis of Tweets

Tool - R

Rationale - it is scripting language which can be used quite effectively in Word Cloud formation and analysis. Our version is 3.2.1

Procedure

We first installed the packages required to run a word cloud-

The following script was run to achieve the same:

install.packages("ROAuth")install.packages("bitops")install.packages("digest")install.packages("rjson")install.packages("NLP")install.packages("twitteR")install.packages("stringr")install.packages("ggplot2")install.packages("tm")install.packages("RColorBrewer")install.packages("wordcloud")install.packages("RCurl")install.packages("httpuv")install.packages("plyr")install.packages("RJSONIO")install.packages("httr")

The libraries were then called for referencing.

Page 3: Group 10 BAUDM Assignment 3

To find out about the predominant terms used in my twitter account we have to first connect to the account through our R script.

The following codec does that job to perfection:

# Download Certificate File

download.file(url = "http://curl.haxx.se/ca/cacert.pem",destfile="cacert.pem")

# Set SSL certs globally

options(RCurlOptions = list(cainfo = system.file("CurlSSL", "cacert.pem", package = "RCurl"))).

The next very important step is to codify the API keys from my account.The following codec does just that:

# set API key and API secret from Twitter developer sitereqURL <- https://api.twitter.com/oauth/request_tokenaccessURL <- https://api.twitter.com/oauth/access_tokenauthURL <- "https://api.twitter.com/oauth/authorize"

#Generate the accessToken after creating the app in twitter, replace with your values#the below values dont work

apiKey <- "sb0mWFVbEFNtJnBQO0fWRUcV"apiSecret <- "7XRvv9FrrL77Z2mHcecF9pygon4GjHtRw49J5RQA3jHWBVpY7"

oauthKey <- "2853123974-OUVIt05vqZRQXYjalZE0kWdoy6ubJyFFWvEzmU"oauthSecret <- "mJGmEk45558v3xOTWacX28179fzqnBQgwf1jAJhexdqm"

Page 4: Group 10 BAUDM Assignment 3

We are essentially searching for the string “Android” so

# search tweets for Twitter Trendstweets = searchTwitter("#Android", n=100)

The above codec creates a handle called as tweets which is used to search the Twitter world for “”Android”

We then create an array also called as a data frame.From that array we create vector called as a Corpus

Significance of a corpus

A corpus is significant in the sense that this vector can be used to perform semantic analysis on a data set.

The following codec illustrates this:

# Converting Tweets to Data Frametweets = do.call("rbind", lapply(tweets, as.data.frame))dim(tweets)#Building the corpuscorpus = Corpus(VectorSource(tweets$text))corpus[[3]]

Now if we need to analyze word clouds using a machine interface like R we need to first prep the source. The prepping was done by converting to lower case then removing punctuation and finally forming a stemmed corpus

# Lower Case

corpus = tm_map(corpus, content_transformer(tolower))corpus[[1]]

Page 5: Group 10 BAUDM Assignment 3

#Remove punctuationcorpus = tm_map(corpus, removePunctuation)corpus[[2]]

#Remove Stop Wordsstopwords("english") [1:1000]corpus = tm_map(corpus, removeWords, c("Android ", stopwords("english")))corpus[[1]]#Stemmingcorpus = tm_map(corpus, stemDocument)corpus[[1]].

The last and the most important step is the Word Cloud Formation-

myDTM = DocumentTermMatrix(corpus, control = list(minWordLength=1))m = as.matrix(sparse)v = sort(colSums(m), decreasing=TRUE)myNames = names(v)myNamesd = data.frame(word=myNames, freq=v)wordcloud(d$word, d$freq, min.freq=4)

Output

Page 6: Group 10 BAUDM Assignment 3

CONCLUSION

we can conclude from the word cloud that the most happening things concerning Android on Twitter are androidgam which is probably stemmed form of Android gaming and something called as Gameinsight