Download - R tutorial
![Page 1: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/1.jpg)
An introduction to R
Version 1.4
November 2014
Prof. Richard Vidgen Management Systems
Hull University Business School E: [email protected]
h=p://datasciencebusiness.wordpress.com
![Page 2: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/2.jpg)
Aims and use of this presentaDon This presentaDon provides an introducDon to R for anyone who wants to get an idea of what R is and what it can do. Although no prior experience of R is needed, a basic understanding of data analysis (e.g., mulDple regression) is assumed, as is a basic technical competence (e.g., installing soKware and managing directory structures). If you are already using SPSS you will get a feel for how it compares with R. It’s a work in progress and will be updated based on experience and feedback.
Docendo discimus, (LaDn "by teaching, we learn").
![Page 3: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/3.jpg)
Contents
• PredicDve analyDcs and analyDcs tools • An overview of R • Installing R and an R IDE (integrated development environment)
• R syntax and data types • MulDple regression in R and SPSS • The Twi=eR package • The Rfacebook package
![Page 4: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/4.jpg)
Resources • The R source and data files for this tutorial can be accessed at: – h=p://datasciencebusiness.wordpress.com
• R_intro.R • R_regression.R • R_twi=er.R • R_twi=er_senDment.R • senDment.R • R_facebook.R • insurance.csv • twi=erHull.csv • posiDve-‐words.txt • negaDve-‐words.txt
WARNING: when R packages are updated by developers things can break and your R programs may stop working. The code in the files above is tested regularly with the latest packages and updated as necessary
![Page 5: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/5.jpg)
Predictive analytics
![Page 6: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/6.jpg)
Be=er decisions -‐ predicDve analyDcs
• A predicDve model that calculates strawberry purchases based on: – Weather forecast – Store temperature – Freezer sensor data – Remaining stock per shelf life – Sales transacDon point of sale feeds – Web searches, social menDons
h=p://www.slideshare.net/datasciencelondon/big-‐data-‐sorry-‐data-‐science-‐what-‐does-‐a-‐data-‐scienDst-‐do
![Page 7: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/7.jpg)
PredicDve analyDcs • For example, what data might help us predict which students will drop
out? – Assessment grades at University – Prior educaDon a=ainment – Social background – Distance of home from University – Friendship circles and networks (e.g., sports club memberships) – A=endance at lectures and tutorials
– InteracDon in lectures and tutorials – Time spent on campus – Time spent in library – Number of accesses to electronic learning resources
– Text books purchased – Engagement in subject-‐related forums – SenDment of social media posts – Etc.
![Page 8: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/8.jpg)
h=p://www.slideshare.net/datasciencelondon/big-‐data-‐sorry-‐data-‐science-‐what-‐does-‐a-‐data-‐scienDst-‐do
![Page 9: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/9.jpg)
Some of the techniques data scienDsts use
• ClassificaDon • Clustering • AssociaDon rules • Decision trees • Regression • GeneDc algorithms • Neural networks and
support vector machines
• Machine learning
• Natural language processing
• SenDment analysis
• ArDficial intelligence
• Time series analysis
• SimulaDons
• Social network analysis
![Page 10: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/10.jpg)
Technologies for data analysis: usage rates
King, J., & R. Magoulas (2013). Data Science Salary Survey. O’Reilly Media.
R and Python programming languages come above Excel
Enterprise products bo=om of the heap
![Page 11: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/11.jpg)
Data scienDst as “bricoleur” “In the pracDcal arts and the fine arts, bricolage (French for "Dnkering") is the construcDon or creaDon of a work from a diverse range of things that happen to be available, or a work created by such a process.” Wikipedia
![Page 12: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/12.jpg)
The R environment
![Page 13: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/13.jpg)
What is R? R is an open source computer language used for data manipulaDon, staDsDcs, and graphics.
![Page 14: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/14.jpg)
History of R
• 1976 – Bell Labs develops S, a language for data analysis; released commercially as S-‐plus
• 1990s – R wri=en and released as open source by (R)oss Ihaka and (R)obert Gentleman
• 1997 – The Comprehensive R Archive Network (CRAN) launched
• August 2014 – CRAN repository contains 5789 user-‐contributed packages
![Page 15: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/15.jpg)
Benefits of R
• It’s free! • Runs on mulDple plaqorms (Windows, Unix, MacOS)
• ValidaDon/replicaDon of analyses (assumes commented code and documentaDon)
• Long term efficiency (using the same code for mulDple projects)
![Page 16: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/16.jpg)
SPSS* vs R
SPSS • Limited ability for data
scienDst to change the environment
• Data scienDst relies on algorithms developed by SPSS
• Problem-‐solving constrained by SPSS developers
• Must pay for using the constrained algorithms
R • Can use funcDons made by
a global community of staDsDcs researchers or create their own
• Almost unlimited in their ability to change their environment
• Can do things SPSS users cannot even dream of
• Get all this for free
*or any other proprietary closed soKware system
![Page 17: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/17.jpg)
h=p://www.r-‐project.org Install R from here
![Page 18: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/18.jpg)
The R console
![Page 19: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/19.jpg)
2. Output appears here
1. Type in commands, select the text and run with (cmd + return) or menu opDon: edit | execute
![Page 20: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/20.jpg)
R integrated development environments (IDEs)
• Some free IDEs – RevoluDon R Enterprise – Architect – R Studio
• Most widely used R IDE • It’s simple and intuiDve • Used to build this tutorial
![Page 21: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/21.jpg)
RevoluDon analyDcs
h=p://www.revoluDonanalyDcs.com
![Page 22: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/22.jpg)
Architect
h=p://www.openanalyDcs.eu
![Page 23: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/23.jpg)
R Studio
h=p://www.rstudio.com Install R Studio from here
![Page 24: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/24.jpg)
R Studio
Type code here
Results appear here
Packages, plots, files
Environment and history
![Page 25: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/25.jpg)
The R language
![Page 26: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/26.jpg)
Basic grammar of R
object = funcDon(arguments)
![Page 27: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/27.jpg)
Guess what this does
Z <-‐ read.table(“MyFile.txt”)
![Page 28: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/28.jpg)
Two ways of doing it
= is the same as
<-‐
![Page 29: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/29.jpg)
Gezng help help(getwd)
![Page 30: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/30.jpg)
Reading the slides
# Comments are in blue <-‐ Code is in green Output is in black
![Page 31: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/31.jpg)
Set the working directory
# For the tutorial, load all the R program and data files provided (see Resources slide) # into a directory of your choice # Set your working directory to this directory, e.g., for a Mac setwd("/Users/ … somewhere on your computer … /R_tutorial") # and for Windows Setwd("C:/ … somewhere on your computer … /R_tutorial") # List the files in the directory list.files() > list.files() [1] "insurance.csv" "negaDve-‐words.txt" "posiDve-‐words.txt" [4] "R_facebook.R" "R_intro.R" "R_regression.R" [7] "R_twi=er_senDment.R" "R_twi=er.R" "senDment.R" [10] "twi=erHull.csv"
![Page 32: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/32.jpg)
Data types and data structures
Data types Numeric Character Logical
Data structures Vectors Lists MulD-‐dimensional
Matrices Dataframes
![Page 33: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/33.jpg)
Vectors and data classes
# this is a comment!num.var <- c(1, 2, 3, 4) # numeric vector!char.var <- c("1", "2", "3", "4") # character vector!log.var <- c(TRUE, TRUE, FALSE, TRUE) # logical vector!
> class(num.var)![1] "numeric"!> class(char.var)![1] "character"!> class(log.var)![1] "logical"!
Values can be combined into vectors using the c() funcDon
Vectors have a class which determines how funcDons treat them
![Page 34: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/34.jpg)
Vectors and data classes
> mean(num.var)![1] 2.5!> mean(char.var)![1] NA!Warning message:!In mean.default(char.var) :! argument is not numeric or logical: returning NA!
Can calculate mean of a numeric vector, but not of a character vector
![Page 35: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/35.jpg)
Lists # create a list -‐ a collecDon of vectors employees <-‐ c("John", "Sunil", "Anna") yearsService <-‐ c(3, 2, 6) empDetails <-‐ list(employees, yearsService) class(empDetails) empDetails
> class(empDetails) [1] "list" > empDetails [[1]] [1] "John" "Sunil" "Anna" [[2]] [1] 3 2 6
![Page 36: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/36.jpg)
Dataframes
DF <- data.frame(x=1:5, y=letters[1:5], z=letters[6:10])!
> DF # data.frame with 3 columns and 5 rows! x y Z!1 1 a f!2 2 b g!3 3 c h!4 4 d i!5 5 e j!
A data.frame is a list of vectors, each of the same length
![Page 37: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/37.jpg)
Multiple regression in R
![Page 38: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/38.jpg)
insurance.csv
• insurance.csv contains medical expenses for paDents enrolled in a healthcare plan
• The data file contains 1,338 cases with features of the paDent as well as the total medical expenses charged to the paDent’s healthcare plan for the calendar year
• There are no missing values (these would be shown as NA in R indicaDng empty or null)
![Page 39: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/39.jpg)
insurance.csv Variable Descrip=on
age an integer indicaDng the age of the beneficiary
sex Either “male” or “female”
bmi body mass index (BMI), which gives an indicaDon of how over or under-‐weight a person is. BMI is calculated as weight in kilograms divideD by height in metres squared. An ideal BMI is in the range 18.5 to 24.9
children an integer showing the number of children/dependents covered by the plan
smoker “yes” or “no”
region the beneficiary’s place of residence, divided into four regions: “northeast”, “southeast”, “southwest”, or “northwest”.
This example is taken from Lantz (2013)
![Page 40: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/40.jpg)
Read the data insurance <-‐ read.csv("insurance.csv",
stringsAsFactors = TRUE) head(insurance)
age sex bmi children smoker region charges!1 19 female 27.900 0 yes southwest 16884.924!2 18 male 33.770 1 no southeast 1725.552!3 28 male 33.000 3 no southeast 4449.462!4 33 male 22.705 0 no northwest 21984.471!5 32 male 28.880 0 no northwest 3866.855!6 31 female 25.740 0 no southeast 3756.622!
![Page 41: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/41.jpg)
Working with data
• There are specific funcDons for reading from (and wriDng to) excel, SPSS, SAS, etc.
• However, the simplest way is to export and import files in csv (comma separated values) format – the lingua franca of data
![Page 42: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/42.jpg)
Explore the data summary(insurance$charges)! Min. 1st Qu. Median Mean 3rd Qu. Max. ! 1122 4740 9382 13270 16640 63770 !
table(insurance$region)!
northeast northwest southeast southwest ! 324 325 364 325 !
cor(insurance[c("age", "bmi", "children", "charges")])! age bmi children charges!age 1.0000000 0.1092719 0.04246900 0.29900819!bmi 0.1092719 1.0000000 0.01275890 0.19834097!children 0.0424690 0.0127589 1.00000000 0.06799823!charges 0.2990082 0.1983410 0.06799823 1.00000000!
![Page 43: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/43.jpg)
Visualise the data
hist(insurance$charges)
![Page 44: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/44.jpg)
Visualise the data pairs(insurance[c("age", "bmi", "children", "charges")])
![Page 45: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/45.jpg)
Visualise the data -‐ be=er library(psych) pairs.panels(insurance[c("age", "bmi", "children", "charges")], hist.col="yellow”)
![Page 46: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/46.jpg)
Installing packages
• pairs.panels is a funcDon in the psych package, which needs to be installed:
![Page 47: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/47.jpg)
MulDple regression 1 ins_model1 <- lm(charges ~ age + children + bmi, data = insurance)!
Residuals:! Min 1Q Median 3Q Max !-13884 -6994 -5092 7125 48627 !!Coefficients:! Estimate Std. Error t value Pr(>|t|) !(Intercept) -6916.24 1757.48 -3.935 8.74e-05 ***!age 239.99 22.29 10.767 < 2e-16 ***!children 542.86 258.24 2.102 0.0357 * !bmi 332.08 51.31 6.472 1.35e-10 ***!---!Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1!!Residual standard error: 11370 on 1334 degrees of freedom!Multiple R-squared: 0.1201, !Adjusted R-squared: 0.1181 !F-statistic: 60.69 on 3 and 1334 DF, p-value: < 2.2e-16!
11.8% of variaDon in insurance charges is explained by the model
![Page 48: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/48.jpg)
SPSS
![Page 49: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/49.jpg)
![Page 50: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/50.jpg)
MulDple regression 2 ins_model2 <- lm(charges ~ age + children + bmi + sex + smoker + region, data = insurance)!Residuals:! Min 1Q Median 3Q Max !-11304.9 -2848.1 -982.1 1393.9 29992.8 !!Coefficients:! Estimate Std. Error t value Pr(>|t|) !(Intercept) -11938.5 987.8 -12.086 < 2e-16 ***!age 256.9 11.9 21.587 < 2e-16 ***!children 475.5 137.8 3.451 0.000577 ***!bmi 339.2 28.6 11.860 < 2e-16 ***!sexmale -131.3 332.9 -0.394 0.693348 !smokeryes 23848.5 413.1 57.723 < 2e-16 ***!regionnorthwest -353.0 476.3 -0.741 0.458769 !regionsoutheast -1035.0 478.7 -2.162 0.030782 * !regionsouthwest -960.0 477.9 -2.009 0.044765 * !---!Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1!!Residual standard error: 6062 on 1329 degrees of freedom!Multiple R-squared: 0.7509, !Adjusted R-squared: 0.7494 !F-statistic: 500.8 on 8 and 1329 DF, p-value: < 2.2e-16!
74.9% of variaDon in insurance charges is explained by the model
![Page 51: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/51.jpg)
Dummy variables
• R coded character vectors as factors and automaDcally analysed them as dummy variables – stringsAsFactors = TRUE
• In SPSS these need to be coded by hand
![Page 52: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/52.jpg)
SPSS -‐ create dummy variables • Recode sex
• 0 = female • 1 = male
• Recode smoker – 1 = yes – 0 = no
• Recode region – Number of dummies = number of groups – 1 – = 4 – 1 = 3
• northeast = 0, 0, 0 • northwest = 1, 0, 0 • southeast = 0, 1, 0 • southwest = 0, 0, 1
![Page 53: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/53.jpg)
Recode
![Page 54: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/54.jpg)
Recode
![Page 55: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/55.jpg)
Paste the syntax
![Page 56: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/56.jpg)
Data set with all dummy coded variables created*
*aKer quite a bit of work!
![Page 57: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/57.jpg)
![Page 58: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/58.jpg)
Mining Twitter with R
![Page 59: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/59.jpg)
Twi=eR
• Install the Twi=eR package to access the Twi=er API
• Before you can access Twi=er from R you have to: – Sign up for a Twi=er developer account – Create a Twi=er app and copy the authenDcaDon details
![Page 60: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/60.jpg)
Twi=er authenDcaDon
For details of how to authenDcate and set up Twi=eR: h=p://thinktostart.wordpress.com/2013/05/22/twi=er-‐authenDficaDon-‐with-‐r/
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx
![Page 61: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/61.jpg)
Twi=er authenDcaDon ############################################# # 1 -‐ AuthenDcate with twi=er API ############################################# library(twi=eR) library(ROAuth) api_key <-‐ ”xxxxxxxxxxxxxxxxxxYsnu5NM" api_secret <-‐ "wm5kU4xxxxxxxxxxxxxxxxxxxxxxxxxxxxQmMyzuBRbATklN05" access_token <-‐ "581xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxFJwaiccnAAzScISQlp4o" access_token_secret <-‐ "tqHnnDDxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxnhscT4sTp" setup_twi=er_oauth(api_key,api_secret,access_token,access_token_secret)
From your Twi=er applicaDon sezngs
![Page 62: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/62.jpg)
Retrieve Twi=er account details ############################################################ # 2 -‐ Get details of Hull Uni Twi=er account ############################################################ twitacc <-‐ getUser('UniOfHull') twitacc$getDescripDon() twitacc$getFollowersCount() friends = twitacc$getFriends(10) # limit the number of friends returned to 10 friends
> twitacc$getDescripDon() [1] "Our official Twi=er feed featuring the latest news and events from the University of Hull" > twitacc$getFollowersCount() [1] 20025 > friends = twitacc$getFriends(10) > friends $`2644622095` [1] "HullNursing" $`1536529304` [1] "HullSimulaDon”
![Page 63: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/63.jpg)
Trace the network net <-‐ getUser(friends[8]) # Sheridansmith1 is friend no. 8 net$getDescripDon() net$getFollowersCount() net$getFriends(n=10)
> net$getDescripDon() [1] "Sister of @damiandsmith of that there band @_TheTorn :) x" > net$getFollowersCount() [1] 502167 > net$getFriends(n=10) $`88598283` [1] "overnightstv" $`423373477` [1] "IsleLoseIt" $`19650489` [1] "donneriron"
![Page 64: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/64.jpg)
![Page 65: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/65.jpg)
Get tweets and write to file
############################################################ # 3 -‐ Search Twi=er ############################################################ Twi=er.list <-‐ searchTwi=er('@UniOfHull', n=500) # limited to 500 tweets for demo Twi=er.df = twListToDF(Twi=er.list) write.csv(Twi=er.df, file='twi=er.csv', row.names=F)
![Page 66: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/66.jpg)
Tweet data read into Excel
![Page 67: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/67.jpg)
Tweet text @UniOfHull
![Page 68: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/68.jpg)
Analyse the tweets: the tm package
# read the twi=er data into a data frame (use previously stored .csv to # replicate the twi=er analysis presented here) tweet_raw <-‐ read.csv(“twi=erHull.csv", stringsAsFactors = FALSE) # remove the non-‐text characters tweet_raw$text <-‐ gsub("[^[:alnum:]///' ]", "", tweet_raw$text) # build a corpus, which is a collecDon of text documents # VectorSource specifies that the source is a character vector library(tm) myCorpus <-‐ Corpus(VectorSource(tweet_raw$text))
![Page 69: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/69.jpg)
Inspect the corpus # examine the sms corpus inspect(myCorpus[1:3]) [[1]] Have you been to the UniOfHull popup campus in Leeds yet It's at the White Cloth Gallery Aire Street for all your Clearing2014 queries [[2]] RT UniOfHull In Newcastle and looking for Clearing2014 advice Head to the UniOfHull popup campus at Newcastle Arts Centre on Westgate [[3]] RT Bethanyn96 Got into unio�ull so happy
![Page 70: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/70.jpg)
Clean corpus and create a wordcloud # clean up the corpus using tm_map() corpus_clean <-‐ tm_map(corpus_clean, PlainTextDocument) corpus_clean <-‐ tm_map(myCorpus, tolower) corpus_clean <-‐ tm_map(corpus_clean, removeNumbers) corpus_clean <-‐ tm_map(corpus_clean, removeWords, stopwords()) corpus_clean <-‐ tm_map(corpus_clean, removePunctuaDon) corpus_clean <-‐ tm_map(corpus_clean, stripWhitespace) # remove any words not wanted in the wordcloud corpus_clean <-‐ tm_map(corpus_clean, removeWords, "unio�ull") # create a wordcloud library(wordcloud) wordcloud(corpus_clean, min.freq = 10, random.order = FALSE, colors=brewer.pal(8, "Dark2"))
![Page 71: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/71.jpg)
August 2014
![Page 72: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/72.jpg)
SenDment analysis
• What is the senDment of the tweets? • How does the number of posiDve words compare with the number of negaDve words?
• The number of posiDve words minus the number of negaDve words gives a rough indicaDon of the “senDment” of the tweet
• The posiDve and negaDve words are taken form a word list developed by Hu and Liu: – h=p://www.cs.uic.edu/~liub/FBS/opinion-‐lexicon-‐English.rar
![Page 73: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/73.jpg)
SenDment analysis
# load libraries library(plyr) library(ggplot2) # load the score.senDment() funcDon – see appendix A for code source( 'senDment.R' ) # read the tweets saved previously hull.tweets <-‐ read.csv(“twi=erHull.csv", stringsAsFactors = FALSE) # read the lists of pos and neg words from Hu & Liu hu.liu.pos = scan('posiDve-‐words.txt', what='character', comment.char=';') hu.liu.neg = scan('negaDve-‐words.txt', what='character', comment.char=';')
![Page 74: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/74.jpg)
SenDment analysis # extract the text of the tweets and pass to the senDment funcDon for # scoring (see Appendix A for R code) hull.text = hull.tweets$text hull.scores = score.senDment(hull.text, pos.words, neg.words, .progress='text') # make a histogram of the scores ggplot(hull.scores, aes(x=score)) + geom_histogram(binwidth=1, colour="black", fill="lightblue") + xlab("SenDment score") + ylab("Frequency") + ggDtle("TWITTER: Hull University SenDment Analysis")
![Page 75: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/75.jpg)
Number of posiDve word matches minus the number of negaDve word matches
![Page 76: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/76.jpg)
WriDng the data out for text analysis
• Many text analysis packages require each comment to be in a separate file
• If the data is in Excel or SPSS it will be cumbersome to generate the files manually
• Write R code instead
![Page 77: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/77.jpg)
Create an output directory # create an output directory for the txt files if it does not exist mainDir <-‐ getwd() subDir <-‐ "outputText" if (file.exists(subDir)){ setwd(file.path(mainDir, subDir)) } else { dir.create(file.path(mainDir, subDir)) setwd(file.path(mainDir, subDir)) }
![Page 78: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/78.jpg)
Loop to write the files
# find out how many rows in the data.frame tweets = nrow(Twi=er.df) # loop to write the txt files for (tweet in 1:tweets) { tweetText = Twi=er.df[tweet, 1] filename = paste("output", tweet, ".txt", sep = "") writeLines(tweetText, con = filename) }
Note that R has iterators that oKen remove the need to write loops, see the “apply” family of funcDons
![Page 79: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/79.jpg)
![Page 80: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/80.jpg)
Accessing Facebook with R
![Page 81: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/81.jpg)
Rfacebook
## loading libraries library(Rfacebook) library(igraph) # get your token from 'h=ps://developers.facebook.com/tools/explorer/' # make sure you give permissions to access your list of friends # set your FB token token <-‐ “xxxxxxxxxxxxxxCAACEdEose0cBALkyqIxxxxxxxxxxxxxxx”
For details of how to authenDcate and use Rfacebook: h=p://pablobarbera.com/blog/archives/3.html Get the access token here: h=ps://developers.facebook.com/tools/explorer
![Page 82: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/82.jpg)
Get the data and graph the network
# download adjacency matrix for network of Facebook friends my_network <-‐ getNetwork(token, format="adj.matrix") # friends who are friends with me alone singletons <-‐ rowSums(my_network)==0 # graph the network my_graph <-‐ graph.adjacency(my_network[!singletons,!singletons]) layout <-‐ layout.drl(my_graph,opDons=list(simmer.a=racDon=0)) plot(my_graph, vertex.size=2, #vertex.label=NA, vertex.label.cex=0.5, edge.arrow.size=0, edge.curved=TRUE,layout=layout)
![Page 83: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/83.jpg)
Facebook network graph
Ok, it’s ugly – there are plenty more social network analysis and graphing packages in R to try, or you can write a bit of code to export the adjacency matrix to another package, e.g., UCINET/Netdraw, Pajek, Gephi
![Page 84: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/84.jpg)
VisualizaDon in Gephi
write.graph(graph = my_graph, file = '�.gml', format = 'gml')
![Page 85: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/85.jpg)
To find out what funcDons a package supports, what they do and how to call them see the documentaDon
![Page 86: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/86.jpg)
Beg, borrow, steal code!
• Don’t bother wriDng code from scratch • If you want to know how to do something then Google it
• There will likely be a soluDon that you can scrape off the screen and modify for your own purposes
• For example – How would you remove rows from a dataframe with missing values (NA)?
– Try “r how to remove missing values” in Google
![Page 87: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/87.jpg)
What’s next?
• Access data held in SQL databases and store your own data, e.g., MySQL, and the packages
• Read, write, manipulate Excel spreadsheets using xls and XLConnect
• Access maps, e.g., Google Maps, and overlay locaDon data (e.g., Tweets) on a map
• Screen scrape Web sites that don’t have APIs (e.g., Google Scholar)
![Page 88: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/88.jpg)
Suggested further reading and resources
• Lantz, B., (2013). Machine Learning with R. Packt Publishing. (highly recommended)
• Miller, T., (2014). Modeling Techniques in Predic;ve Analy;cs: Business Problems and Solu;ons. Pearson EducaDon.
• R Reference Card 2.0 – h=p://cran.r-‐project.org/doc/contrib/Baggo=-‐refcard-‐v2.pdf
• R Reference Card for Data Mining – h=p://cran.r-‐project.org/doc/contrib/YanchangZhao-‐refcard-‐data-‐mining.pdf
• R-‐bloggers for news and tutorials – h=p://www.r-‐bloggers.com
![Page 89: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/89.jpg)
Appendices
![Page 90: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/90.jpg)
#' #' score.sentiment() implements a very simple algorithm to estimate #' sentiment, assigning a integer score by subtracting the number #' of occurrences of negative words from that of positive words. #' #' @param sentences vector of text to score #' @param pos.words vector of words of postive sentiment #' @param neg.words vector of words of negative sentiment #' @param .progress passed to <code>laply()</code> to control of progress bar. #' @returnType data.frame #' @return data.frame of text and corresponding sentiment scores #' @author Jefrey Breen [email protected]
Appendix A The score.senDment() funcDon
h=ps://github.com/jeffreybreen/twi=er-‐senDment-‐analysis-‐tutorial-‐201107
![Page 91: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/91.jpg)
score.sentiment = function(sentences, pos.words, neg.words, .progress='none') { require(plyr) require(stringr) # we got a vector of sentences. plyr will handle a list or a vector as an "l" for us # we want a simple array of scores back, so we use "l" + "a" + "ply" = laply: scores = laply(sentences, function(sentence, pos.words, neg.words) { # remove the non-text characters sentence <- gsub("[^[:alnum:]///' ]", "", sentence) # clean up sentences with R's regex-driven global substitute, gsub(): sentence = gsub('[[:punct:]]', '', sentence) sentence = gsub('[[:cntrl:]]', '', sentence) sentence = gsub('\\d+', '', sentence) # and convert to lower case: sentence = tolower(sentence)
![Page 92: R tutorial](https://reader031.vdocument.in/reader031/viewer/2022020217/54b7ac494a79590e468b45de/html5/thumbnails/92.jpg)
# split into words. str_split is in the stringr package word.list = str_split(sentence, '\\s+') # sometimes a list() is one level of hierarchy too much words = unlist(word.list) # compare our words to the dictionaries of positive & negative terms pos.matches = match(words, pos.words) neg.matches = match(words, neg.words) # match() returns the position of the matched term or NA # we just want a TRUE/FALSE: pos.matches = !is.na(pos.matches) neg.matches = !is.na(neg.matches) # and conveniently enough, TRUE/FALSE will be treated as 1/0 by sum(): score = sum(pos.matches) - sum(neg.matches) return(score) }, pos.words, neg.words, .progress=.progress ) scores.df = data.frame(score=scores, text=sentences) return(scores.df) }