basic introduction into r
TRANSCRIPT
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
Introduction into RPart 1A
Richard L. Zijdeman
2016-06-15
Richard L. Zijdeman Introduction into R
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
1 Quantitave research methods
2 Data analysis workflow
3 Statistical Software
4 Installing R and RStudio
5 Getting help
Richard L. Zijdeman Introduction into R
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
Quantitave research methods
Richard L. Zijdeman Introduction into R
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
Why
To answer descriptive and explanatory questions on populations
Richard L. Zijdeman Introduction into R
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
Workflow: PTE
problem (research question)theory (hypothesis)empirical test . . . with loops between T-E and P-T-E
Richard L. Zijdeman Introduction into R
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
Research Questions
descriptive (to what extent. . . )comparative (comparing two entities)
trend (comparison over time)
explanatory (focus on mechanism at hand)
Richard L. Zijdeman Introduction into R
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
Theory
deductive reasoningexplanans
general mechanismcondition
explanandum (hypothesis)
Richard L. Zijdeman Introduction into R
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
Empirical test
sample vs. populationrandom vs. stratified samplestesting technique, e.g.:
T-test, correlation, regression
Software required for faster analysis
Richard L. Zijdeman Introduction into R
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
Data analysis workflow
Richard L. Zijdeman Introduction into R
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
Empirical testings has its own workflow
Grolemund & Wickham, 2016, Creative CommonsAttribution-NonCommercial-NoDerivs 4.0.
Richard L. Zijdeman Introduction into R
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
Statistical Software
Richard L. Zijdeman Introduction into R
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
The dangers of analysing with spreadsheets(e.g. MS Excel)
tempting to input and clean data and analyse in the same sheetdi�cult to track cleaning rulesdefaults mess up your data (e.g. 01200 -> 1200)
Richard L. Zijdeman Introduction into R
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
Why use syntax (scripting)
E�ciency (really)Quality (error checking)ReplicatabilityCommunication
Richard L. Zijdeman Introduction into R
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
R
R is open source, which is good and bad:anybody can contribute (check, improve, create code)free of chargebut: R depends on collective action
cannot ‘demand’ supportsprawl of packages
Richard L. Zijdeman Introduction into R
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
RStudio
browser for Rprovides easy access to:
scriptsdataplotsmanual
Richard L. Zijdeman Introduction into R
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
Installing R and RStudio
Richard L. Zijdeman Introduction into R
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
Download R
Instructions via http://www.r-project.org
Choose a CRAN mirrorhttp://cran.r-project.org/mirrors.html
close, but active too!Romania hasn’t gone (yet!)
Click on ‘Download R for Windows’Follow usual installation procedureDouble click on R
You should now have a working session!Close the session, do not save workspace image
Richard L. Zijdeman Introduction into R
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
Packages and libraries
base R (core product)additional packages
CRAN repositoryspread through ‘mirrors’
choose a local, but active mirror
Githubpackages not on CRAN
development versions of CRAN libraries
Richard L. Zijdeman Introduction into R
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
RStudio
RStudio is found on http://www.rstudio.com
Download the version for your OS (e.g. windows)http://www.rstudio.com/products/rstudio/download/
Install by double clicking on the downloaded fileStart RStudio by double clicking on the iconYou do not need to start R, before starting RStudio
Richard L. Zijdeman Introduction into R
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
Getting help
Richard L. Zijdeman Introduction into R
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
Build-in help: “?”
?[function] / ?[package]e.g. “?plot” or “?graphics”
check the index for user guides and vignettes
Richard L. Zijdeman Introduction into R
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
Cran website
ManualsR FAQR Journal
Richard L. Zijdeman Introduction into R
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
Online communities
StackoverflowInstance of StackexchangeReputation based Q&A
Specific lists for packages, e.g.:ggplot2R-sig-mixed-models
Richard L. Zijdeman Introduction into R
Quantitave research methodsData analysis workflow
Statistical SoftwareInstalling R and RStudio
Getting help
Asking a question Getting an answer
Search the web: others must have had this problem tooIf you raise a question:
be politebe conciseshort backgroundreplicatable exampledebrief your e�orts sofar
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Introduction into R
Part 1B
Richard L. Zijdeman
2016-06-15
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
1
Introducing RStudio and R
2
Introducing base R
3
Data visualization using ggplot2
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Introducing RStudio and R
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
RStudio
Rstudio is sort of a ‘viewer’ on Rhelps to organize input and output:
editor (upper left)console (lower left)environment (upper right)output (lower right)
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
R script
series of ))commands to manipulate dataalways save your script, NEVER change your data
original data + script = reproducable research
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Packages
Build your R system using packages‘Base R’ is basic. Add packages for your specific needsPackages are found on servers, called ‘mirrors’
Make sure to select a mirror firsthttps://cran.r-project.org/mirrors.html%5Bhttps://cran.r-project.org/mirrors.html%5D
## To permanently add the mirror, type:options(repos=structure(
c(CRAN="http://cran.xl-mirror.nl")))## replace http://... with your favorite mirror
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Packages for book (see 1.4.2)
pkgs <- c("broom", "dplyr", "ggplot2", "jpeg", "jsonlite","knitr", "Lahman", "microbenchmark", "png", "pryr","purrr", "rcorpora", "readr", "stringr", "tibble","tidyr"
)install.packages(pkgs)
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
R Session
contains scripts, data, functionscan be saved ‘workspace image’prefer not to:
sessions are usually clutteredonly useful if running script takes time
Suggested tweak:Options: uncheck “Restore .RData into workspace at startup”Options: Save workspace to .RData on exit, select ‘never’
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Introducing base R
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
base R: assignment and print()
‘attach’ values to an object (e.g. a variable)
x <- 5y <- 4z <- x * yprint(z)
## [1] 20
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
base R: assignment and print() (II)
Try and imagine the potential of assignment
x <- c(4, 3, 2, 1, 0, 27, 34, 35)# �c� for concatenate values
y <- -1z <- x*yprint(z)
## [1] -4 -3 -2 -1 0 -27 -34 -35
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
base R: data.frame
basically a tablecontains columns (variables)contains rows (cases)“flat table” in Kees’ terminology
my.df <- data.frame(x,z)str(my.df) # show STRucture
## �data.frame�: 8 obs. of 2 variables:## $ x: num 4 3 2 1 0 27 34 35## $ z: num -4 -3 -2 -1 0 -27 -34 -35
There’s much more, but let’s keep that for tomorrow
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Data visualization using ggplot2
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Visualizing your data
Not just for analyses!Data quality
representativenessmissing data
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
plot() in base R
library(help = "datasets") # all datasets in R
?mtcars # show help on mtcars dataset
df <- mtcars()str(mtcars) # display STRucture of an object
plot(mtcars$hp, mtcars$mpg)plot(df)
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
plot() is like . . .
plot() is like latex:Forge it in anyway you wantHeterogeneous approach thoughTakes quite some time to get it right
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
ggplot() as alternative
ggplot is but one of many graph packages ggplot is nice bc, of:similar approach to various types of graphseasy build up for basic graphscan get quite complex too(but cannot do it all)
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
ggplot() and the canvas metaphore
ggplot() consists of two elementscanvas(multiple) layers of paint
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
mapping and geom layers
ggplot() consists of two elementscanvas:
datamapping (aesthetic)
(multiple) layers of paintgeom layers
ggplot(data = <DATASET>,mapping = aes(x = <X-VAR>, y = <Y-VAR>)) +
geom_<TYPE>
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
our first ggplot
install.packages("ggplot2")library(ggplot2)df <- mtcarsggplot(data = df, aes(x = hp, y = mpg)) +
geom_point()
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
geom_ features
? geom_point
install.packages("ggplot2")library(ggplot2)df <- mtcarsggplot(data = df, aes(x = hp, y = mpg)) +
geom_point(fill = "white", colour = "blue",shape = 21, size = 4)
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Adding characteristics to your plot
Add variables to explain a pattern
ggplot(data = df, aes(x = hp, y = mpg)) +geom_point(aes(colour = wt), size = 4)
NB: notice the di�erence?
ggplot(data = df, aes(x = hp, y = mpg)) +geom_point(aes(colour = wt, size = 4))
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Multiple geom’s
Add variables to explain a pattern
ggplot(data = df, aes(x = hp, y = mpg)) +geom_point(aes(colour = as.factor(am)),
size = 6) + # increase size bc overlap
geom_point(aes(shape = as.factor(vs)),size = 3)
#V/S whether V8 (0) or Straight (European) (1)
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Adding facets
Facets help reduce complexity
ggplot(data = df, aes(x = hp, y = mpg)) +geom_point(aes(colour = as.factor(am)),
size = 4) +facet_wrap( ~ vs)
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Things to consider with geom(_point)
fill only works where shape actually can be filledconsider order of geomsmind overlap:
decrease sizeuse alphause ‘open’ shapesgeom_jitter
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
ggplot and titles
Various ways to add titlex to axes and stu�Can get quite complexHere’s the basiscs
ggplot(data = df, aes(x = hp, y = mpg)) +geom_point() +
labs(title = "Nice graph", x = "Horse Power",y = "Miles per Gallon" )
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Themes and size
ggplot(data = df, aes(x = hp, y = mpg)) +geom_point() +
labs(title = "Nice graph", x = "Horse Power",y = "Miles per Gallon" ) +
theme_bw(base_size = 16)
Richard L. Zijdeman Introduction into R
Introducing RStudio and R
Introducing base R
Data visualization using ggplot2
Much more to learn
not just about ggplot()axeslegend (guides)geoms
also about dataviz in generalgeneral do’s and don’tswhich problem fits which graphit’s a science! (Graph theory)
Richard L. Zijdeman Introduction into R
Data wrangling
bit about NA
Introduction into R
Part 2A, 2B
Richard L. Zijdeman
2016-06-16
Richard L. Zijdeman Introduction into R
Data wrangling
bit about NA
1
Data wrangling
2
bit about NA
Richard L. Zijdeman Introduction into R
Data wrangling
bit about NA
Data wrangling
Richard L. Zijdeman Introduction into R
Data wrangling
bit about NA
Grolemund & Wickham, 2016, Creative Commons
Attribution-NonCommercial-NoDerivs 4.0.
Richard L. Zijdeman Introduction into R
Data wrangling
bit about NA
dplyr package
# install.packages("dplyr") # 1 time only
library(dplyr)
install.packages("nycflights13")
library(nycflights13)
print(flights)
Richard L. Zijdeman Introduction into R
Data wrangling
bit about NA
tibble or data_frame vs data.frame
str(mtcars)
class(mtcars)
mtcars_tbl <- as_data_frame(mtcars)
str(mtcars)
class(mtcars)
Richard L. Zijdeman Introduction into R
Data wrangling
bit about NA
filter
filter(mtcars, am == 1, vs == 0)
some.cars <- filter(mtcars, am == 1, vs == 0)
some.cars
(some.cars2 <- filter(mtcars, am == 1, vs == 0))
Richard L. Zijdeman Introduction into R
Data wrangling
bit about NA
filter and using or
filter(mtcars, gear == 3 | gear == 4) # !! not like this:
filter(mtcars, gear == 3 | 4)
Richard L. Zijdeman Introduction into R
Data wrangling
bit about NA
bit about NA
Richard L. Zijdeman Introduction into R
Data wrangling
bit about NA
Arrange
arrange(flights, dep_time)
arrange(flights, year, month, day) # ascending order
arrange(flights, desc(day))
# NB: missing values come at end
Richard L. Zijdeman Introduction into R
Data wrangling
bit about NA
Select
df <- select(flights, year, month, day)
names(flights)
df <- select(flights, tailnum:dest)
df <- select(flights, -(tailnum:dest))
df
df <- select(flights, starts_with("arr_"))
df <- select(flights, ends_with("e"))
df <- select(flights, contains("a"))
Richard L. Zijdeman Introduction into R
Data wrangling
bit about NA
rename
df <- rename(flights, Y_ear = year)
df <- mutate(flights, year1 = year+1)
select(df, year, year1)
df <- mutate(flights, year1 = year + 1, year2 = year1+1)
select(df, contains("year"))
df <- transmute(flights, year1 = year + 1, year2 = year1+1)
# only maintains the newly created variables
Richard L. Zijdeman Introduction into R
Data wrangling
bit about NA
group_by
by_day <- group_by(flights, year, month, day)
summarise(by_day)
cars <- mtcars
cars <- as_data_frame(mtcars)
summarise(cars, mean_hp = mean(hp, na.rm = TRUE))
mean(cars$hp, na.rm = TRUE)
Richard L. Zijdeman Introduction into R
Data wrangling
bit about NA
the pipe: %>%
cars_grp <- group_by(cars, carb)
class(cars)
class(cars_grp)
summarise(cars_grp, mmpg = mean(mpg, na.rm = TRUE))
cars_grp_sum <- summarise(cars_grp,
mmpg = mean(mpg, na.rm = TRUE),
count = n())
cars_grp_sum
plot <- ggplot(cars_grp_sum,
aes(x = carb, y = mmpg,
label = carb)) +
geom_point(aes(size = count)) +
geom_text(colour = "cyan")
plot
cars_grp_sum2 <- cars %>%
group_by(carb) %>%
summarise(mmpg = mean(mpg, na.rm = TRUE),
count = n())
ggplot(cars_grp_sum2, aes(x = carb, y = mmpg, label = carb)) +
geom_point(aes(size = count)) +
geom_text(colour = "cyan") +
labs(title = "figure with %>%")
Richard L. Zijdeman Introduction into R
Data wrangling
bit about NA
more pipe, adding a filter
cars_grp_sum3 <- cars %>%
group_by(carb) %>%
summarise(mmpg = mean(mpg, na.rm = TRUE),
count = n()) %>%
filter(count > 3)
ggplot(cars_grp_sum3, aes(x = carb, y = mmpg, label = carb)) +
geom_point(aes(size = count)) +
geom_text(colour = "cyan") +
labs(title = "figure with %>% and count > 3")
Richard L. Zijdeman Introduction into R
Session management
Basic data manipulation
Introduction into R
Part 3A
Richard L. Zijdeman
2016-06-17
Richard L. Zijdeman Introduction into R
Session management
Basic data manipulation
1
Session management
2
Basic data manipulation
Richard L. Zijdeman Introduction into R
Session management
Basic data manipulation
Session management
Richard L. Zijdeman Introduction into R
Session management
Basic data manipulation
Maintaining your workspace
Grolemund & Wickham, 2016, Creative Commons
Attribution-NonCommercial-NoDerivs 4.0.
Richard L. Zijdeman Introduction into R
Session management
Basic data manipulation
Setting up a session
clear your Environment
check sessionInfo() for loaded packages
detach obsolete packages under ‘other attached packages’
set your directory (“" on windows and”/" for linux/mac)
load libraries (install new ones)
load your data
Richard L. Zijdeman Introduction into R
Session management
Basic data manipulation
Example session setup
rm(list = ls())sessionInfo() # check for �other attached packages�
detach("package:nycflights13", unload = TRUE)setwd("/Users/RichardZ/Dropbox/
Summer school 2016/Richard Zijdeman/")getwd() # to see whether you�re in the right directory
dir() # shows what�s in your directory
Richard L. Zijdeman Introduction into R
Session management
Basic data manipulation
Loading your data
read.table() (generic function)
read.csv()
library(foreign) # e.g. SPSS and Stata
library(readxl) # fast excel-package
Richard L. Zijdeman Introduction into R
Session management
Basic data manipulation
Reading in data
Di�erent functions for di�erent files:
Base R: read.table() (read.csv())
foreign package: read.spss(), read.dta(), read.dbf()
readxl
alternatives packages:
xlsx(Java required)
gdata (perl-based)
openxlsx package: read.xlsx()
Richard L. Zijdeman Introduction into R
Session management
Basic data manipulation
read.csv()
file: your file, including directory
header: variable names or not?
sep: seperator
read.csv default: “,”
read.csv2 default: “;”
skip: number of rows to skip
nrows: total number of rows to read
stringsAsFactors
encoding (e.g. “latin1” or “UTF-8”)
Richard L. Zijdeman Introduction into R
Session management
Basic data manipulation
read_excel from readxl package
path: your file, including directory
sheet: name or number of sheet
col_names: col names in 1st row?
col_types: specify type
na: what’s the sign for missing values
skip: how many rows to skip before data starts
Richard L. Zijdeman Introduction into R
Session management
Basic data manipulation
Example session loading your csv data
# setwd() to set your working directory
hmar100 <- read.csv("./Datafiles_HSN/HSN_marriages.csv",stringsAsFactors = FALSE,encoding = "latin1",header = TRUE,nrows = 100) # just first 100 rows
Richard L. Zijdeman Introduction into R
Session management
Basic data manipulation
Example session loading your excel data
# setwd() to set your working directory
install.packages("readxl")library("readxl")hmar <- read_excel("./Datafiles_HSN/HSN_marriages_awful.xlsx",
col_names = TRUE,skip = 3) # empty lines not counted!!!
Richard L. Zijdeman Introduction into R
Session management
Basic data manipulation
Basic data manipulation
Richard L. Zijdeman Introduction into R
Session management
Basic data manipulation
Change case of text
tolower()
toupper()
tolower("CaN we pleASe jUSt have LOWER cases?")names(hmar) <- tolower(names(hmar))
Richard L. Zijdeman Introduction into R
Session management
Basic data manipulation
length()
Used to count how many instances there are
length(names(hmar))# shows number of variables in hmar
Richard L. Zijdeman Introduction into R
Basic statistical techniques
Introduction into R
Part 3B
Richard L. Zijdeman
2016-06-17
Richard L. Zijdeman Introduction into R
Basic statistical techniques
1
Basic statistical techniques
Richard L. Zijdeman Introduction into R
Basic statistical techniques
Basic statistical techniques
Richard L. Zijdeman Introduction into R
Basic statistical techniques
Box and whisker plot
Distribution of dataMedian: 50% of the cases above and belowBox: 1st and 3rd quartileInterquartile range (IQR): Q3-Q1Outliers (Tukey, 1977):
x < Q1 - 1.5*IQRx > Q3 + 1.5*IQR
Richard L. Zijdeman Introduction into R
Basic statistical techniques
p <- ggplot(hmar, aes(sign_groom, age_groom))
p + geom_boxplot()
Richard L. Zijdeman Introduction into R
Basic statistical techniques
hmar <- mutate(hmar, sign_groomD = (sign_groom == "h" & !(is.na(sign_groom))))
p <- ggplot(hmar, aes(sign_groomD, age_groom))
p + geom_boxplot()
Richard L. Zijdeman Introduction into R
Basic statistical techniques
hmar <- mutate(hmar, sign_groomD = (sign_groom == "h" & !(is.na(sign_groom))))
p <- ggplot(hmar, aes(sign_groomD, age_groom))
p + geom_boxplot() + geom_jitter(shape = 24, width = 0.2)
Richard L. Zijdeman Introduction into R
Basic statistical techniques
library(stats)
var.test(age_groom ~ sign_groomD, data = hmar)
t.test(age_groom ~ sign_groomD, data = hmar)
# NB: always check for variances
Richard L. Zijdeman Introduction into R
Basic statistical techniques
A small PTE project
Look at the variables in the HSN filesThink of a research questionProvide a general mechanism and hypothesisPlot your results
Richard L. Zijdeman Introduction into R