Download - R tutorial

An introduction to R

Version 1.4

November 2014

Prof. Richard Vidgen Management Systems

Hull University Business School E: [email protected]

h=p://datasciencebusiness.wordpress.com

Aims and use of this presentaDon This presentaDon provides an introducDon to R for anyone who wants to get an idea of what R is and what it can do. Although no prior experience of R is needed, a basic understanding of data analysis (e.g., mulDple regression) is assumed, as is a basic technical competence (e.g., installing soKware and managing directory structures). If you are already using SPSS you will get a feel for how it compares with R. It’s a work in progress and will be updated based on experience and feedback.

Docendo discimus, (LaDn "by teaching, we learn").

Contents

•  PredicDve analyDcs and analyDcs tools •  An overview of R •  Installing R and an R IDE (integrated development environment)

•  R syntax and data types •  MulDple regression in R and SPSS •  The Twi=eR package •  The Rfacebook package

Resources •  The R source and data files for this tutorial can be accessed at: –  h=p://datasciencebusiness.wordpress.com

•  R_intro.R •  R_regression.R •  R_twi=er.R •  R_twi=er_senDment.R •  senDment.R •  R_facebook.R •  insurance.csv •  twi=erHull.csv •  posiDve-‐words.txt •  negaDve-‐words.txt

WARNING: when R packages are updated by developers things can break and your R programs may stop working. The code in the files above is tested regularly with the latest packages and updated as necessary

Predictive analytics

Be=er decisions -‐ predicDve analyDcs

•  A predicDve model that calculates strawberry purchases based on: – Weather forecast – Store temperature – Freezer sensor data – Remaining stock per shelf life – Sales transacDon point of sale feeds – Web searches, social menDons

h=p://www.slideshare.net/datasciencelondon/big-‐data-‐sorry-‐data-‐science-‐what-‐does-‐a-‐data-‐scienDst-‐do

PredicDve analyDcs •  For example, what data might help us predict which students will drop

out? –  Assessment grades at University –  Prior educaDon a=ainment –  Social background –  Distance of home from University –  Friendship circles and networks (e.g., sports club memberships) –  A=endance at lectures and tutorials

–  InteracDon in lectures and tutorials –  Time spent on campus –  Time spent in library –  Number of accesses to electronic learning resources

–  Text books purchased –  Engagement in subject-‐related forums –  SenDment of social media posts –  Etc.

h=p://www.slideshare.net/datasciencelondon/big-‐data-‐sorry-‐data-‐science-‐what-‐does-‐a-‐data-‐scienDst-‐do

Some of the techniques data scienDsts use

•  ClassificaDon •  Clustering •  AssociaDon rules •  Decision trees •  Regression •  GeneDc algorithms •  Neural networks and

support vector machines

•  Machine learning

•  Natural language processing

•  SenDment analysis

•  ArDficial intelligence

•  Time series analysis

•  SimulaDons

•  Social network analysis

Technologies for data analysis: usage rates

King, J., & R. Magoulas (2013). Data Science Salary Survey. O’Reilly Media.

R and Python programming languages come above Excel

Enterprise products bo=om of the heap

Data scienDst as “bricoleur” “In the pracDcal arts and the fine arts, bricolage (French for "Dnkering") is the construcDon or creaDon of a work from a diverse range of things that happen to be available, or a work created by such a process.” Wikipedia

The R environment

What is R? R is an open source computer language used for data manipulaDon, staDsDcs, and graphics.

History of R

•  1976 – Bell Labs develops S, a language for data analysis; released commercially as S-‐plus

•  1990s – R wri=en and released as open source by (R)oss Ihaka and (R)obert Gentleman

•  1997 – The Comprehensive R Archive Network (CRAN) launched

•  August 2014 – CRAN repository contains 5789 user-‐contributed packages

Benefits of R

•  It’s free! •  Runs on mulDple plaqorms (Windows, Unix, MacOS)

•  ValidaDon/replicaDon of analyses (assumes commented code and documentaDon)

•  Long term efficiency (using the same code for mulDple projects)

SPSS* vs R

SPSS •  Limited ability for data

scienDst to change the environment

•  Data scienDst relies on algorithms developed by SPSS

•  Problem-‐solving constrained by SPSS developers

•  Must pay for using the constrained algorithms

R •  Can use funcDons made by

a global community of staDsDcs researchers or create their own

•  Almost unlimited in their ability to change their environment

•  Can do things SPSS users cannot even dream of

•  Get all this for free

*or any other proprietary closed soKware system

h=p://www.r-‐project.org Install R from here

The R console

2. Output appears here

1. Type in commands, select the text and run with (cmd + return) or menu opDon: edit | execute

R integrated development environments (IDEs)

•  Some free IDEs – RevoluDon R Enterprise – Architect – R Studio

•  Most widely used R IDE •  It’s simple and intuiDve •  Used to build this tutorial

RevoluDon analyDcs

h=p://www.revoluDonanalyDcs.com

Architect

h=p://www.openanalyDcs.eu

R Studio

h=p://www.rstudio.com Install R Studio from here

R Studio

Type code here

Results appear here

Packages, plots, files

Environment and history

The R language

Basic grammar of R

object = funcDon(arguments)

Guess what this does

Z <-‐ read.table(“MyFile.txt”)

Two ways of doing it

= is the same as

<-‐

Gezng help help(getwd)

Reading the slides

# Comments are in blue <-‐ Code is in green Output is in black

Set the working directory

# For the tutorial, load all the R program and data files provided (see Resources slide) # into a directory of your choice # Set your working directory to this directory, e.g., for a Mac setwd("/Users/ … somewhere on your computer … /R_tutorial") # and for Windows Setwd("C:/ … somewhere on your computer … /R_tutorial") # List the files in the directory list.files() > list.files() [1] "insurance.csv" "negaDve-‐words.txt" "posiDve-‐words.txt" [4] "R_facebook.R" "R_intro.R" "R_regression.R" [7] "R_twi=er_senDment.R" "R_twi=er.R" "senDment.R" [10] "twi=erHull.csv"

Data types and data structures

Data types Numeric Character Logical

Data structures Vectors Lists MulD-‐dimensional

Matrices Dataframes

Vectors and data classes

# this is a comment!num.var <- c(1, 2, 3, 4) # numeric vector!char.var <- c("1", "2", "3", "4") # character vector!log.var <- c(TRUE, TRUE, FALSE, TRUE) # logical vector!

> class(num.var)![1] "numeric"!> class(char.var)![1] "character"!> class(log.var)![1] "logical"!

Values can be combined into vectors using the c() funcDon

Vectors have a class which determines how funcDons treat them

Vectors and data classes

> mean(num.var)![1] 2.5!> mean(char.var)![1] NA!Warning message:!In mean.default(char.var) :! argument is not numeric or logical: returning NA!

Can calculate mean of a numeric vector, but not of a character vector

Lists # create a list -‐ a collecDon of vectors employees <-‐ c("John", "Sunil", "Anna") yearsService <-‐ c(3, 2, 6) empDetails <-‐ list(employees, yearsService) class(empDetails) empDetails

> class(empDetails) [1] "list" > empDetails [[1]] [1] "John" "Sunil" "Anna" [[2]] [1] 3 2 6

Dataframes

DF <- data.frame(x=1:5, y=letters[1:5], z=letters[6:10])!

> DF # data.frame with 3 columns and 5 rows! x y Z!1 1 a f!2 2 b g!3 3 c h!4 4 d i!5 5 e j!

A data.frame is a list of vectors, each of the same length

Multiple regression in R

insurance.csv

•  insurance.csv contains medical expenses for paDents enrolled in a healthcare plan

•  The data file contains 1,338 cases with features of the paDent as well as the total medical expenses charged to the paDent’s healthcare plan for the calendar year

•  There are no missing values (these would be shown as NA in R indicaDng empty or null)

insurance.csv Variable Descrip=on

age an integer indicaDng the age of the beneficiary

sex Either “male” or “female”

bmi body mass index (BMI), which gives an indicaDon of how over or under-‐weight a person is. BMI is calculated as weight in kilograms divideD by height in metres squared. An ideal BMI is in the range 18.5 to 24.9

children an integer showing the number of children/dependents covered by the plan

smoker “yes” or “no”

region the beneficiary’s place of residence, divided into four regions: “northeast”, “southeast”, “southwest”, or “northwest”.

This example is taken from Lantz (2013)

Read the data insurance <-‐ read.csv("insurance.csv",

stringsAsFactors = TRUE) head(insurance)

age sex bmi children smoker region charges!1 19 female 27.900 0 yes southwest 16884.924!2 18 male 33.770 1 no southeast 1725.552!3 28 male 33.000 3 no southeast 4449.462!4 33 male 22.705 0 no northwest 21984.471!5 32 male 28.880 0 no northwest 3866.855!6 31 female 25.740 0 no southeast 3756.622!

Working with data

•  There are specific funcDons for reading from (and wriDng to) excel, SPSS, SAS, etc.

•  However, the simplest way is to export and import files in csv (comma separated values) format – the lingua franca of data

Explore the data summary(insurance$charges)! Min. 1st Qu. Median Mean 3rd Qu. Max. ! 1122 4740 9382 13270 16640 63770 !

table(insurance$region)!

northeast northwest southeast southwest ! 324 325 364 325 !

cor(insurance[c("age", "bmi", "children", "charges")])! age bmi children charges!age 1.0000000 0.1092719 0.04246900 0.29900819!bmi 0.1092719 1.0000000 0.01275890 0.19834097!children 0.0424690 0.0127589 1.00000000 0.06799823!charges 0.2990082 0.1983410 0.06799823 1.00000000!

Visualise the data

hist(insurance$charges)

Visualise the data pairs(insurance[c("age", "bmi", "children", "charges")])

Visualise the data -‐ be=er library(psych) pairs.panels(insurance[c("age", "bmi", "children", "charges")], hist.col="yellow”)

Installing packages

•  pairs.panels is a funcDon in the psych package, which needs to be installed:

MulDple regression 1 ins_model1 <- lm(charges ~ age + children + bmi, data = insurance)!

Residuals:! Min 1Q Median 3Q Max !-13884 -6994 -5092 7125 48627 !!Coefficients:! Estimate Std. Error t value Pr(>|t|) !(Intercept) -6916.24 1757.48 -3.935 8.74e-05 ***!age 239.99 22.29 10.767 < 2e-16 ***!children 542.86 258.24 2.102 0.0357 * !bmi 332.08 51.31 6.472 1.35e-10 ***!---!Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1!!Residual standard error: 11370 on 1334 degrees of freedom!Multiple R-squared: 0.1201, !Adjusted R-squared: 0.1181 !F-statistic: 60.69 on 3 and 1334 DF, p-value: < 2.2e-16!

11.8% of variaDon in insurance charges is explained by the model

MulDple regression 2 ins_model2 <- lm(charges ~ age + children + bmi + sex + smoker + region, data = insurance)!Residuals:! Min 1Q Median 3Q Max !-11304.9 -2848.1 -982.1 1393.9 29992.8 !!Coefficients:! Estimate Std. Error t value Pr(>|t|) !(Intercept) -11938.5 987.8 -12.086 < 2e-16 ***!age 256.9 11.9 21.587 < 2e-16 ***!children 475.5 137.8 3.451 0.000577 ***!bmi 339.2 28.6 11.860 < 2e-16 ***!sexmale -131.3 332.9 -0.394 0.693348 !smokeryes 23848.5 413.1 57.723 < 2e-16 ***!regionnorthwest -353.0 476.3 -0.741 0.458769 !regionsoutheast -1035.0 478.7 -2.162 0.030782 * !regionsouthwest -960.0 477.9 -2.009 0.044765 * !---!Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1!!Residual standard error: 6062 on 1329 degrees of freedom!Multiple R-squared: 0.7509, !Adjusted R-squared: 0.7494 !F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16!

74.9% of variaDon in insurance charges is explained by the model

Dummy variables

•  R coded character vectors as factors and automaDcally analysed them as dummy variables – stringsAsFactors = TRUE

•  In SPSS these need to be coded by hand

SPSS -‐ create dummy variables •  Recode sex

•  0 = female •  1 = male

•  Recode smoker –  1 = yes –  0 = no

•  Recode region –  Number of dummies = number of groups – 1 –  = 4 – 1 = 3

•  northeast = 0, 0, 0 •  northwest = 1, 0, 0 •  southeast = 0, 1, 0 •  southwest = 0, 0, 1

Recode

Paste the syntax

Data set with all dummy coded variables created*

*aKer quite a bit of work!

Mining Twitter with R

Twi=eR

•  Install the Twi=eR package to access the Twi=er API

•  Before you can access Twi=er from R you have to: – Sign up for a Twi=er developer account – Create a Twi=er app and copy the authenDcaDon details

Twi=er authenDcaDon

For details of how to authenDcate and set up Twi=eR: h=p://thinktostart.wordpress.com/2013/05/22/twi=er-‐authenDficaDon-‐with-‐r/

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx

Twi=er authenDcaDon ############################################# # 1 -‐ AuthenDcate with twi=er API ############################################# library(twi=eR) library(ROAuth) api_key <-‐ ”xxxxxxxxxxxxxxxxxxYsnu5NM" api_secret <-‐ "wm5kU4xxxxxxxxxxxxxxxxxxxxxxxxxxxxQmMyzuBRbATklN05" access_token <-‐ "581xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxFJwaiccnAAzScISQlp4o" access_token_secret <-‐ "tqHnnDDxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxnhscT4sTp" setup_twi=er_oauth(api_key,api_secret,access_token,access_token_secret)

From your Twi=er applicaDon sezngs

Retrieve Twi=er account details ############################################################ # 2 -‐ Get details of Hull Uni Twi=er account ############################################################ twitacc <-‐ getUser('UniOfHull') twitacc$getDescripDon() twitacc$getFollowersCount() friends = twitacc$getFriends(10) # limit the number of friends returned to 10 friends

> twitacc$getDescripDon() [1] "Our official Twi=er feed featuring the latest news and events from the University of Hull" > twitacc$getFollowersCount() [1] 20025 > friends = twitacc$getFriends(10) > friends $`2644622095` [1] "HullNursing" $`1536529304` [1] "HullSimulaDon”

Trace the network net <-‐ getUser(friends[8]) # Sheridansmith1 is friend no. 8 net$getDescripDon() net$getFollowersCount() net$getFriends(n=10)

> net$getDescripDon() [1] "Sister of @damiandsmith of that there band @_TheTorn :) x" > net$getFollowersCount() [1] 502167 > net$getFriends(n=10) $`88598283` [1] "overnightstv" $`423373477` [1] "IsleLoseIt" $`19650489` [1] "donneriron"

Get tweets and write to file

############################################################ # 3 -‐ Search Twi=er ############################################################ Twi=er.list <-‐ searchTwi=er('@UniOfHull', n=500) # limited to 500 tweets for demo Twi=er.df = twListToDF(Twi=er.list) write.csv(Twi=er.df, file='twi=er.csv', row.names=F)

Tweet data read into Excel

Tweet text @UniOfHull

Analyse the tweets: the tm package

# read the twi=er data into a data frame (use previously stored .csv to # replicate the twi=er analysis presented here) tweet_raw <-‐ read.csv(“twi=erHull.csv", stringsAsFactors = FALSE) # remove the non-‐text characters tweet_raw$text <-‐ gsub("[^[:alnum:]///' ]", "", tweet_raw$text) # build a corpus, which is a collecDon of text documents # VectorSource specifies that the source is a character vector library(tm) myCorpus <-‐ Corpus(VectorSource(tweet_raw$text))

Inspect the corpus # examine the sms corpus inspect(myCorpus[1:3]) [[1]] Have you been to the UniOfHull popup campus in Leeds yet It's at the White Cloth Gallery Aire Street for all your Clearing2014 queries [[2]] RT UniOfHull In Newcastle and looking for Clearing2014 advice Head to the UniOfHull popup campus at Newcastle Arts Centre on Westgate [[3]] RT Bethanyn96 Got into unio�ull so happy

Clean corpus and create a wordcloud # clean up the corpus using tm_map() corpus_clean <-‐ tm_map(corpus_clean, PlainTextDocument) corpus_clean <-‐ tm_map(myCorpus, tolower) corpus_clean <-‐ tm_map(corpus_clean, removeNumbers) corpus_clean <-‐ tm_map(corpus_clean, removeWords, stopwords()) corpus_clean <-‐ tm_map(corpus_clean, removePunctuaDon) corpus_clean <-‐ tm_map(corpus_clean, stripWhitespace) # remove any words not wanted in the wordcloud corpus_clean <-‐ tm_map(corpus_clean, removeWords, "unio�ull") # create a wordcloud library(wordcloud) wordcloud(corpus_clean, min.freq = 10, random.order = FALSE, colors=brewer.pal(8, "Dark2"))

August 2014

SenDment analysis

•  What is the senDment of the tweets? •  How does the number of posiDve words compare with the number of negaDve words?

•  The number of posiDve words minus the number of negaDve words gives a rough indicaDon of the “senDment” of the tweet

•  The posiDve and negaDve words are taken form a word list developed by Hu and Liu: –  h=p://www.cs.uic.edu/~liub/FBS/opinion-‐lexicon-‐English.rar

SenDment analysis

# load libraries library(plyr) library(ggplot2) # load the score.senDment() funcDon – see appendix A for code source( 'senDment.R' ) # read the tweets saved previously hull.tweets <-‐ read.csv(“twi=erHull.csv", stringsAsFactors = FALSE) # read the lists of pos and neg words from Hu & Liu hu.liu.pos = scan('posiDve-‐words.txt', what='character', comment.char=';') hu.liu.neg = scan('negaDve-‐words.txt', what='character', comment.char=';')

SenDment analysis # extract the text of the tweets and pass to the senDment funcDon for # scoring (see Appendix A for R code) hull.text = hull.tweets$text hull.scores = score.senDment(hull.text, pos.words, neg.words, .progress='text') # make a histogram of the scores ggplot(hull.scores, aes(x=score)) + geom_histogram(binwidth=1, colour="black", fill="lightblue") + xlab("SenDment score") + ylab("Frequency") + ggDtle("TWITTER: Hull University SenDment Analysis")

Number of posiDve word matches minus the number of negaDve word matches

WriDng the data out for text analysis

•  Many text analysis packages require each comment to be in a separate file

•  If the data is in Excel or SPSS it will be cumbersome to generate the files manually

•  Write R code instead

Create an output directory # create an output directory for the txt files if it does not exist mainDir <-‐ getwd() subDir <-‐ "outputText" if (file.exists(subDir)){ setwd(file.path(mainDir, subDir)) } else { dir.create(file.path(mainDir, subDir)) setwd(file.path(mainDir, subDir)) }

Loop to write the files

# find out how many rows in the data.frame tweets = nrow(Twi=er.df) # loop to write the txt files for (tweet in 1:tweets) { tweetText = Twi=er.df[tweet, 1] filename = paste("output", tweet, ".txt", sep = "") writeLines(tweetText, con = filename) }

Note that R has iterators that oKen remove the need to write loops, see the “apply” family of funcDons

Accessing Facebook with R

Rfacebook

## loading libraries library(Rfacebook) library(igraph) # get your token from 'h=ps://developers.facebook.com/tools/explorer/' # make sure you give permissions to access your list of friends # set your FB token token <-‐ “xxxxxxxxxxxxxxCAACEdEose0cBALkyqIxxxxxxxxxxxxxxx”

For details of how to authenDcate and use Rfacebook: h=p://pablobarbera.com/blog/archives/3.html Get the access token here: h=ps://developers.facebook.com/tools/explorer

Get the data and graph the network

# download adjacency matrix for network of Facebook friends my_network <-‐ getNetwork(token, format="adj.matrix") # friends who are friends with me alone singletons <-‐ rowSums(my_network)==0 # graph the network my_graph <-‐ graph.adjacency(my_network[!singletons,!singletons]) layout <-‐ layout.drl(my_graph,opDons=list(simmer.a=racDon=0)) plot(my_graph, vertex.size=2, #vertex.label=NA, vertex.label.cex=0.5, edge.arrow.size=0, edge.curved=TRUE,layout=layout)

Facebook network graph

Ok, it’s ugly – there are plenty more social network analysis and graphing packages in R to try, or you can write a bit of code to export the adjacency matrix to another package, e.g., UCINET/Netdraw, Pajek, Gephi

VisualizaDon in Gephi

write.graph(graph = my_graph, file = '�.gml', format = 'gml')

To find out what funcDons a package supports, what they do and how to call them see the documentaDon

Beg, borrow, steal code!

•  Don’t bother wriDng code from scratch •  If you want to know how to do something then Google it

•  There will likely be a soluDon that you can scrape off the screen and modify for your own purposes

•  For example – How would you remove rows from a dataframe with missing values (NA)?

–  Try “r how to remove missing values” in Google

What’s next?

•  Access data held in SQL databases and store your own data, e.g., MySQL, and the packages

•  Read, write, manipulate Excel spreadsheets using xls and XLConnect

•  Access maps, e.g., Google Maps, and overlay locaDon data (e.g., Tweets) on a map

•  Screen scrape Web sites that don’t have APIs (e.g., Google Scholar)

Suggested further reading and resources

•  Lantz, B., (2013). Machine Learning with R. Packt Publishing. (highly recommended)

•  Miller, T., (2014). Modeling Techniques in Predic;ve Analy;cs: Business Problems and Solu;ons. Pearson EducaDon.

•  R Reference Card 2.0 –  h=p://cran.r-‐project.org/doc/contrib/Baggo=-‐refcard-‐v2.pdf

•  R Reference Card for Data Mining –  h=p://cran.r-‐project.org/doc/contrib/YanchangZhao-‐refcard-‐data-‐mining.pdf

•  R-‐bloggers for news and tutorials –  h=p://www.r-‐bloggers.com

Appendices

#' #' score.sentiment() implements a very simple algorithm to estimate #' sentiment, assigning a integer score by subtracting the number #' of occurrences of negative words from that of positive words. #' #' @param sentences vector of text to score #' @param pos.words vector of words of postive sentiment #' @param neg.words vector of words of negative sentiment #' @param .progress passed to <code>laply()</code> to control of progress bar. #' @returnType data.frame #' @return data.frame of text and corresponding sentiment scores #' @author Jefrey Breen [email protected]

Appendix A The score.senDment() funcDon

h=ps://github.com/jeffreybreen/twi=er-‐senDment-‐analysis-‐tutorial-‐201107

score.sentiment = function(sentences, pos.words, neg.words, .progress='none') { require(plyr) require(stringr) # we got a vector of sentences. plyr will handle a list or a vector as an "l" for us # we want a simple array of scores back, so we use "l" + "a" + "ply" = laply: scores = laply(sentences, function(sentence, pos.words, neg.words) { # remove the non-text characters sentence <- gsub("[^[:alnum:]///' ]", "", sentence) # clean up sentences with R's regex-driven global substitute, gsub(): sentence = gsub('[[:punct:]]', '', sentence) sentence = gsub('[[:cntrl:]]', '', sentence) sentence = gsub('\\d+', '', sentence) # and convert to lower case: sentence = tolower(sentence)

# split into words. str_split is in the stringr package word.list = str_split(sentence, '\\s+') # sometimes a list() is one level of hierarchy too much words = unlist(word.list) # compare our words to the dictionaries of positive & negative terms pos.matches = match(words, pos.words) neg.matches = match(words, neg.words) # match() returns the position of the matched term or NA # we just want a TRUE/FALSE: pos.matches = !is.na(pos.matches) neg.matches = !is.na(neg.matches) # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum(): score = sum(pos.matches) - sum(neg.matches) return(score) }, pos.words, neg.words, .progress=.progress ) scores.df = data.frame(score=scores, text=sentences) return(scores.df) }

Download - R tutorial

Top Related