r for beginners - wordpress.com · 3/25/2019 r for beginners...

3/25/2019 R for beginners

file:///X:/PIs/Seba/Teaching/R/R_important_commands_list_Seba/R_important_commands_website.html 1/135

R for beginnersSebastian Korb ([email protected] (mailto:[email protected]))25 March, 2019

IntroWhy do R (and not SPSS) ?

Free and widely usedCombine code, analyses, plots, and textDriving a car vs. riding a busLarge variety and flexibility of graphs in R:

e.g. show word frequency with a word cloudMake interactive graphs

How to set up R:PackagesHow to install old packages

Know your version of RHow to master the basics of the R languageHow to cite RR MarkdownCommands, objects and functionsScriptsComment generouslyEnvironmentMy list of R commands

Get the working directorySet the working directoryCreate objectsClear objectsCheck what objects are in your EnvironmentCreate a new folderList files in a folderData typesRead data in R

mailto:[email protected]



read.delimBuilt-in data

Saving dataExplore data (descriptive statistics)

Checking normalityChecking homogeneity of variance

Indexing elements in a vector or dataframeCoerce to data.framePlotting (with base graphics)

Scatter plot (2 variables)The par callScatter plot (more variables)Line plotSunflower plotBox plotMosaic plotPie chartsBarplotHistogramQQ plot

Plotting with ggplot2Bar and line plots using base graphicsBar and line plots with ggplot2ggplot2’s terminologyScatter plots with ggplot2Scatter plots with plot() vs. ggplot()Bar plot with errorbarsBox plot with data points (grouped by 1 factor)Box plot with data points (grouped by 2 factors)Violin and box plots combined

Pirate plotsArranging several plots in a figureSaving your plots and graphsFlow control

for() loops



if… statementif…elseif…else ififelse()

Create new variable names in loopCreate a functionTake input from userStatistics with R

Descriptives (ddply)CorrelationCorrplots and heatmapsStudent’s t-test

One-sample t-testIndependent t-testDependent (paired-sample) t-test

Mann-Whitney U & Wilcoxon signed ranks testsSimple regressionMultiple regressionANOVA (one-way, two-way, within, between, mixed, rep measures)

One-way ANOVArepeated measures ANOVA

IntroWelcome to this non-exhaustive list and description of the basics of programming in R, and of some of the most useful commands to work withdatasets of the kind you would get in a typical psychology experiment (e.g. responses, reaction times, etc.).

For me it is useful as a place where I can quickly look things up.

For you it might serve the same purpose. If instead you are new to R or to programming in general, this doc (and the links in it) can be a goodplace to get started.

Be aware that although this doc is primarily about R, you will sometimes find references to and comparisons with the Matlab language (anotherprogram I and psychologists/neuroscientists in general like to use).

Good luck!!



Sebastian Korb, Ph.D. Faculty of Psychology, University of Vienna, Austria My webpage (https://sebkorb.wordpress.com/) For comments please write to sebkorb1[at]gmail.com

Why do R (and not SPSS) ?Free and widely usedR is completely free, and there is a large community using it. World’s best statisticians and programmers constantly improve old functions andcreate and distribute new R functions and packages. If you have a question about how to do anything in R, you most likely will find an answer to itthrough a Google search.

Combine code, analyses, plots, and textUsing RStudio (an editor that interfaces and helps with R) and its RMarkdown function, you can combine R code, analyses, plots, and written textinto 1 nice-looking document. This allows to easily share all your data manipulation and analysis steps, making them transparent, shareable, andreproducible

Driving a car vs. riding a busWanting to compare R and SPSS, a certain Greg Snow wrote the following on stackoverflow (https://stackoverflow.com/a/3790270):

Busses are very easy to use, you just need to know which bus to get on, where to get on, and where to get off (and you need to pay your fare).Cars on the other hand require much more work, you need to have some type of map or directions (even if the map is in your head), you need toput gas in every now and then, you need to know the rules of the road (have some type of drivers licence). The big advantage of the car is that itcan take you a bunch of places that the bus does not go and it is quicker for some trips that would require transfering between busses.

Using this analogy programs like SPSS are busses, easy to use for the standard things, but very frustrating if you want to do something that isnot already preprogrammed.

R is a 4-wheel drive SUV (though environmentally friendly) with a bike on the back, a kayak on top, good walking and running shoes in thepasenger seat, and mountain climbing and spelunking gear in the back.

R can take you anywhere you want to go if you take time to learn how to use the equipment, but that is going to take longer than learning wherethe bus stops are in SPSS.

Large variety and flexibility of graphs in R:e.g. show word frequency with a word cloud

https://sebkorb.wordpress.com/

https://stackoverflow.com/a/3790270



see examples in the R-graph gallery (https://www.r-graph-gallery.com/wordcloud/), and description of the wordcloud2 package here (https://cran.r-project.org/web/packages/wordcloud2/vignettes/wordcloud.html)

Several word clouds were included in this review article:

To create your own word clouds, you need to install the wordcloud2 package

library(wordcloud2) head(demoFreq) # some word frequency count

## word freq ## oil oil 85 ## said said 73 ## prices prices 48 ## opec opec 42 ## mln mln 31 ## the the 26

# Now let's make the corresponding word cloud. Notice how frequent words like 'oil' appear larger compared to less # frequent words like 'the' wordcloud2(data = demoFreq)

https://www.r-graph-gallery.com/wordcloud/

https://cran.r-project.org/web/packages/wordcloud2/vignettes/wordcloud.html



Make interactive graphs

Use the package Plotly (https://plot.ly/r/getting-started/) to create interactive graphs, which you can also embed in your website. This allows you to better explore, understand, and explain your data.

install.packages("plotly")

Let’s make interactive boxplots with the dataset “midwest” from the package “ggplot2”. It contains demographic information of midwest counties across 5 states (IL, IN, MI, OH, WI). We’ll focus on the percentage of people with a college education.

https://plot.ly/r/getting-started/



library(plotly)

data(midwest) p <- plot_ly(midwest, x = ~percollege, color = ~state, type = "box") p

Moving the cursor over the graph allows you to see the values corresponding to the quartiles and the outliers.

For example, you can see that in Wisconsin Dane county (where the capital Madison lies) has a 43.6 % of college educated people.



as.character(midwest[midwest$state == "WI" & midwest$percollege >= 43, "county"])

## [1] "DANE"

How to set up R:Download R from the “mirror” closest to you (https://cran.r-project.org/mirrors.html)Install the free version of RStudio Desktop (https://www.rstudio.com/products/rstudio/download/#download). This is an Editor and makes itmuch nicer to work with R. RStudio automatically recognizes your R version. But if you want to check just type version in the ConsoleR comes with some basic functions. But very often you will need additional “packages” or functions. Each new package needs to be installedonce.For example, to download the widely used package ggplot from CRAN, the Comprehensive R Archive Network, typeinstall.packages("ggplot2") .

Once installed, you must load a package every time you want to use it. To do so, add library(ggplot2) to (the beginnning of) your script.Alternatively, use the “Packages”" tab in the bottom-right corner of the RStudio interface.

* There is also the possibility to use a function from a package

without loading it, or better by loading it temporarily. For example, to use the chordDiagram function from the package circlize :

https://cran.r-project.org/mirrors.html

https://www.rstudio.com/products/rstudio/download/#download



circlize::chordDiagram(matrix(sample(10), nrow = 4, ncol = 5))

PackagesPackages are sets of functions that some smart R-wizard wrote and made available for everybody.

Most packages can be downloades from the Comprehensive R Archive Network (CRAN) (https://cran.r-project.org/). But actually you don’t need togo to that website and instead can do it from within R or RStudio.

https://cran.r-project.org/



For that hit the “Pachages” tab (probably on the lower right of your RStudio), then “Install”, and enter the name of the desired package. RStudioshould download this automatically. Alternatively use the function install.packages("NAME.OF.YOUR.PACKAGE") .

Now the package has been installed, and will be included in the list of packages in RStudio. However, you still need to load it everytime you wantto use it. For that either click the box next to the package name in the “Packages” tab of RStudio. Or use this command:library(NAME.OF.YOUR.PACKAGE) (no more quotation marks needed this time).

If you want to see a list of all the packages that you have currently loaded, type search() , and to see where on your computer these packagesare stored type searchpaths()

Some of the packages I use most are: ggplot2 for graphs, lme4 for linear mixed models, but also car, Hmisc .

How to install old packagesIf you are not using the latest R version (e.g. because your old OS does not allow it), it might be necessary to install older versions of certain packages. THIS (https://support.rstudio.com/hc/en-us/articles/219949047-Installing-older-versions-of-packages) link explains you how to do it.

require(devtools) install_version("ggplot2", version = "0.9.1", repos = "http://cran.us.r-project.org")

Know your version of RSometimes a package might not install or run because your R version is too old. To find out which version you are using, type version in theConsole and hit Enter

version

https://support.rstudio.com/hc/en-us/articles/219949047-Installing-older-versions-of-packages



## _ ## platform i386-w64-mingw32 ## arch i386 ## os mingw32 ## system i386, mingw32 ## status ## major 3 ## minor 5.1 ## year 2018 ## month 07 ## day 02 ## svn rev 74947 ## language R ## version.string R version 3.5.1 (2018-07-02) ## nickname Feather Spray

How to master the basics of the R languageUse Quick-R (https://www.statmethods.net/)Read “An Introduction to R” (https://cran.r-project.org/doc/manuals/r-release/R-intro.html)Read “YaRrr! The Pirate???s Guide to R” (https://bookdown.org/ndphillips/YaRrr/) (Highly recommended!!)Have a look at this short list of the most useful R commands (http://personality-project.org/r/r.commands.html)Check out a series of cheatsheets (https://www.rstudio.com/resources/cheatsheets/) for various R and RStudio functionalitiesFind more links to manuals and tutorials on my Web Page (https://sebkorb.wordpress.com/r-links/)Read the book “Discovering Statistics Using R” (https://us.sagepub.com/en-us/nam/discovering-statistics-using-r/book236067%20) by AndyFieldUse the help in R itself: type ? followed by the name of the function, e.g. ?mean or (with space) ? mean , or get to the start of the onlinehelp with help.start()

https://www.statmethods.net/

https://cran.r-project.org/doc/manuals/r-release/R-intro.html

https://bookdown.org/ndphillips/YaRrr/

http://personality-project.org/r/r.commands.html

https://www.rstudio.com/resources/cheatsheets/

https://sebkorb.wordpress.com/r-links/

https://us.sagepub.com/en-us/nam/discovering-statistics-using-r/book236067%20



How to cite RR Core Team. (2016). R: A language and environment for statistical computing. Vienna, Austria: R Foundation for Statistical Computing.Retrieved from < http://www.R-project.org (http://www.R-project.org) >Also describe which packages you used (or at least the most important ones) e.g. “The data were analyzed with linear mixed effects models(LMMs) using the lmer() function of the lme4 package in R (R Core Team, 2016)”

R MarkdownI have used R Markdown (http://rmarkdown.rstudio.com/) to create this file.

In a nutshell, R Markdown allows you to write a neat looking .html or .pdf file containing both text and R code (or code in other languages like Bashor Python).

The code can be executed and you can have the results show up in your .html file.

I highly recommend checking it out and using it for teaching (you can also create presentations - similar to what you would do with Powerpoint) orto create a data analysis report that can be shared with you coworkers.

Some places where you might find help to learn R Markdown on the internet. Here (http://markdown-guide.readthedocs.io/) is one useful resource,and here (https://gist.github.com/jeromyanglim/2716336) another one.

http://www.r-project.org/

http://rmarkdown.rstudio.com/

http://markdown-guide.readthedocs.io/

https://gist.github.com/jeromyanglim/2716336



Commands, objects and functionsAn object is anything that you create in R and that can contain numbers, text, plots, or statiscal models. Per convention objects are created by writing the name of the object followed by "<-" and then whatever you want to put into it. It does alsowork if you replace the "<-" with "=" , but R users will almost certainly look down on you.we can create objects in R with commandscommands are typically made of 2 parts:

objectsfunctions

and separated by a left-pointing arrow ( <- ), although you could also use an equal sign ( = )

# for example, if I wanted to put the year 2018 into an object (also called variable) with the name thisyear, I'd write thisyear <- 2018 # this automatically creates it and one way to check is by typing just the name thisyear # R is case sensitive, so be careful with the big and small caps!

## [1] 2018



ScriptsEverything you can do in the Console you can also do in a Script

Using scripts allows you to save your commands, and reuse them on later occasions. This also allows to save time be adapting parts of old scripts to analyze new data

To open a new script, go to File > New File > Script … or use the shortcut (on a Mac it is: cmd + Shift + Enter)

To execute the entire script, hit the RUN button

To execute 1 line, put the cursor somewhere in that line and hit crtl + Enter (windows) or cmd + Enter (Mac)

To execute more than 1 line select them (holding down left mouse button) and then hit crtl/cmd + Enter



Comment generouslyI strongly advise to put a lot of comments into your R code. This will make it easier to understand for your colleagues with whom you might share it,but also for you when you look at it after some months or years.

In R, everything appearing after a hashtag (#) is interpreted as a comment, and therefore not executed as code.

In contrast to other programming languages (e.g. Python), you however cannot write comments that span several lines. To do that, write # at thebeginning of each new line.

EnvironmentIn RStudio you’ll see an Environment window (probably on the right)This is the workspace, where every object of that session is saved

thik of it as your working memorywhen you quit R (cmd + Q on a Mac), you are asked if you want to save the Workspace

this creates a .RData file with all the objects of your workspace and the contents of the Console windowHowever, it is not really necessary to save the workspace, if you have a script that can reproduce it anytime. Just make sure to save thescript, and you’ll be fine.

My list of R commandsGet the working directoryBy default R goes to the directory in which R is saved on your computer.

Verify with:

getwd() # find out which folder on your computer R is looking at

## [1] "X:/PIs/Seba/Teaching/R/R_important_commands_list_Seba"

Set the working directory



Type setwd("/DISK/FOLDER/") to change working directory to Or use the mouse to go to Session > Set Working Directory

Create objectsAn object is anything that you creates and that can contain numbers, text, plots, or statiscal models. Per convention objects are created by writingthe name of the object followed by <- and then whatever you want to put into it. It does also work if you replace the <- with = (as it is done inMatlab), but R users will almost certainly look down on you.

# for example, if I wanted to put the year 2017 into an object (also called variable) with the name thisyear, I'd write thisyear <- 2017 # this automatically creates it and one way to check is by typing just the name thisyear # R is case sensitive, so be careful with the big and small caps!

## [1] 2017

Clear objectsWhen I start a new R script I typically first delete everything in the Workspace (in RStudio it is called Environment) For that I use:

rm(list = ls())

If instead you want to delete only a specific object, say the variable a :

a <- 1 # let's create the variable with name 'a' and fill it with the integer 1 rm(a) # or remove several variables at once (however without emptying the Workspace) rm(a, b, c)

Check what objects are in your EnvironmentOf course you could do this simply by hitting the Environment tab…. …but here is how you do it by writing in the Console:

a <- 1 # let's create variable a again ls()



## [1] "a"

Create a new folderMyfolder <- "/Users/seba/TESTFOLDER/" dir.create(Myfolder)

List files in a folderTo list all the files in a specific directory, you can use the function list.files()

list.files(Myfolder) # if you don't write anything it lists what is in the working directory

You can also look for files with a specific extension

# To only list files containing the .html extension Sourcedir = Myfolder list.files(path = Sourcedir, pattern = ".html")

Or you might want to be even more specific, and indicate an extension and some text in the middle of the name

# To select .html files that also contain 'course' in the name You create 'wildcards' with `.*` Sourcedir = Myfolder list.files(path = Sourcedir, pattern = ".*course.*.html")

You come to the same solution using the function Sys.glob() , which allows to do wildcard expansion (also known al “globbing”) on file paths

Sys.glob("*course*.html")

Data typesThere are several types of data R can handle (e.g. see here (https://www.statmethods.net/input/datatypes.html) for a list).

https://www.statmethods.net/input/datatypes.html



The data types you will be working with most of the time are these:

Vectors contain series of numbers (double), texts (characters or strings), or TRUE/FALSE answers (logical)

# create a numeric vector. use c() to combine values. a <- c(0, NA, 3.2, -15, sqrt(4)) # NA means not available; sqrt() computes the square root # create a character vector b <- c("Hello", "World", "how", "are", "you") # the quotes (single or double both work) indicate a text # create a logical vector c <- c(TRUE, FALSE, TRUE, FALSE, FALSE) # must all be caps!

Data frames are like Excel’s Worksheets, or SPSS datasets. Each column can contain different things (numeric, character, etc.)

# create a data frame with the vectors a, b,and c that we just created my_first_dataframe <- data.frame(a, b, c) # we could also change the column names (currently they are a, b, c) colnames(my_first_dataframe) <- c("some_numbers", "my_sentence", "logic_test") # now let's have a look at it my_first_dataframe

## some_numbers my_sentence logic_test ## 1 0.0 Hello TRUE ## 2 NA World FALSE ## 3 3.2 how TRUE ## 4 -15.0 are FALSE ## 5 2.0 you FALSE

You might have noticed, that I put underscores ( _ ) instead of spaces in the column names. The reason is that this makes working with themeasier, trust me.

Lists are ordered collections of things, which can be wildly different (strings, numeric vectors, matrices, scalars, etc.). They are highlyflexible

# create a list of 5 elements my_first_list <- list(aName = "whatever", aNum = 3.7, NumericVector = a, CharacterVector = b, Matrix = my_first_dataframe)

Factors store nominal values, which is useful for statistical analyses. More precisely, factors store nominal values in one vector with integers(e.g. with each number corresponding to a unique condition in an experiment), and an internal vector of character strings (the original



values) mapped to these integers

# the 2nd column in the just created data frame "my_first_dataframe"" is a factor # R has decided so by itself, since that column does not contain numbers or logical data # One way to find out is this: str(my_first_dataframe)

## 'data.frame': 5 obs. of 3 variables: ## $ some_numbers: num 0 NA 3.2 -15 2 ## $ my_sentence : Factor w/ 5 levels "are","Hello",..: 2 4 3 1 5 ## $ logic_test : logi TRUE FALSE TRUE FALSE FALSE

# compare to the list "my_first_list" str(my_first_list)

## List of 5 ## $ aName : chr "whatever" ## $ aNum : num 3.7 ## $ NumericVector : num [1:5] 0 NA 3.2 -15 2 ## $ CharacterVector: chr [1:5] "Hello" "World" "how" "are" ... ## $ Matrix :'data.frame': 5 obs. of 3 variables: ## ..$ some_numbers: num [1:5] 0 NA 3.2 -15 2 ## ..$ my_sentence : Factor w/ 5 levels "are","Hello",..: 2 4 3 1 5 ## ..$ logic_test : logi [1:5] TRUE FALSE TRUE FALSE FALSE

** Change a factor to numeric or character It is possible to change a factor into a numeric vector (if it contains numbers), or into a vector filled withcharacters (if it contains text).

# we add a new column to my_first_dataframe, call it 'my_sentence_char', and fill it with a character version of 'my # sentence' (i.e. we transform from factor to character) my_first_dataframe$my_sentence_char <- as.character(my_first_dataframe$my_sentence) # then we check again str(my_first_dataframe)



## 'data.frame': 5 obs. of 4 variables: ## $ some_numbers : num 0 NA 3.2 -15 2 ## $ my_sentence : Factor w/ 5 levels "are","Hello",..: 2 4 3 1 5 ## $ logic_test : logi TRUE FALSE TRUE FALSE FALSE ## $ my_sentence_char: chr "Hello" "World" "how" "are" ...

Read data in Rread.delim(this is perfect for reading delimited files, defaulting to the TAB character for the delimiter) but see also read.table() , read.csv() , read.spss()

Here (https://www.datacamp.com/community/tutorials/r-data-import-tutorial) is a useful tutorial by DataCamp on how to import different types ofdata into R

First you need to tell R where to find the data. You can do this by setting the working directory to the right folder using setwd() But I actually have the habit of first defining a working directory in my script, and then concatenating its name with the name of the file I want toread in.

SourceDir <- "/Users/seba/TESTFOLDER/" Mydata <- read.delim(paste(SourceDir, "MyFile.txt", sep = ""), header = TRUE) # the function *paste* concatenates objects, in this case SourceDir and 'MyFile.txt' with sep='' I asked it to do so # without adding anything in between # if your data is a tab-delimited .txt file, you can indicate the separator Mydata <- read.table(paste(SourceDir, "MyFile.txt", sep = ""), header = TRUE, sep = "\t")

Built-in dataR comes with some data of its own (in the package datasets), and the same is true for many additional packages. On this web page(http://www.sthda.com/english/wiki/r-built-in-data-sets) some of the more common datasets are described. To know all the data your R has built intype data() . To know the data in a specific package type try(data(package = "datasets") )

In the package “datasets” you’ll find the data mtcars, which is one of the better known built-in datasets. Load it with data(mtcars)

https://www.datacamp.com/community/tutorials/r-data-import-tutorial

http://www.sthda.com/english/wiki/r-built-in-data-sets



data(mtcars) # load into R the built-in dataset called 'mtcars' # this creates the object mtcars in the Environment

To get some information about this dataset type help(mtcars) or ?mtcars

You can also look at the actual data, either by typing mtcars in the Console (this will write all the data), or by double clicking with the left mousebutton on mtcars in the Environment tab. This will open a new window, which will look very similar to an Excel Worksheet. The same can beobtained by typing View(mtcars) in the Console.

Saving dataYou can save as

tab-delimited .txt file with write.table(mtcars, "mydata.txt", sep="\t") (this can be opened by every program, and can be read inExcel)

If you do not indicate a path, it will be saved in the working directory (remember that you can use getwd() to know, and setwd() to change theworking directory)

I personally prefer defining a “Savedir” folder at the beginning of my script, and then use that when saving

Savedir <- "/Users/seba/my_R" dir.create(Savedir) # if you need to create the folder (it will not be overwritten if it exists) write.table(mtcars, paste(Savedir, "/this_is_my_dataframe.txt", sep = ""), sep = "\t") # this works almost fine. But better to exclude the row names using row.names=FALSE: write.table(mtcars, paste(Savedir, "/this_is_my_dataframe.txt", sep = ""), sep = "\t", row.names = FALSE)

You can also save directly to an .xlsx spreadsheet using write.xlsx()

library(xlsx) write.xlsx(mtcars, paste(Savedir, "/this_is_my_dataframe.xlsx", sep = ""), row.names = FALSE)

Save to SPSS using write.foreign() (but honestly this didn’t work for me, so I recommend importing the .txt file in SPSS instead)



library(foreign) rownames(mtcars) <- c() write.foreign(mtcars, paste(Savedir, "/this_is_my_dataframe.txt", sep = ""), paste(Savedir, "/this_is_my_dataframe.sps", sep = ""), package = "SPSS")

Explore data (descriptive statistics)class(mtcars) # the mtcars object is a data.frame, something similar to an Excel spreadsheet

## [1] "data.frame"

dim(mtcars) # get dimensions (rows, columns)

## [1] 32 11

nrow(mtcars) # get the number of rows (specific for data.frames)

## [1] 32

ncol(mtcars) # get the number of columns (specific for data.frames)

## [1] 11

names(mtcars) # get the variablenames

## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" ## [11] "carb"



library(car) some(mtcars) # randomly selects 10 rows (part of the car package)

## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 ## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1 ## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1 ## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4 ## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4 ## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4 ## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 ## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 ## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8

head(mtcars) # shows the first 6 rows

## mpg cyl disp hp drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1

tail(mtcars) # shows the last 6 rows

## mpg cyl disp hp drat wt qsec vs am gear carb ## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.7 0 1 5 2 ## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.9 1 1 5 2 ## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.5 0 1 5 4 ## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.5 0 1 5 6 ## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.6 0 1 5 8 ## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.6 1 1 4 2



summary(mtcars) # summarizes data per column (min, median, mean, max, etc.)

## mpg cyl disp hp ## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 ## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 ## Median :19.20 Median :6.000 Median :196.3 Median :123.0 ## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 ## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 ## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 ## drat wt qsec vs ## Min. :2.760 Min. :1.513 Min. :14.50 Min. :0.0000 ## 1st Qu.:3.080 1st Qu.:2.581 1st Qu.:16.89 1st Qu.:0.0000 ## Median :3.695 Median :3.325 Median :17.71 Median :0.0000 ## Mean :3.597 Mean :3.217 Mean :17.85 Mean :0.4375 ## 3rd Qu.:3.920 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:1.0000 ## Max. :4.930 Max. :5.424 Max. :22.90 Max. :1.0000 ## am gear carb ## Min. :0.0000 Min. :3.000 Min. :1.000 ## 1st Qu.:0.0000 1st Qu.:3.000 1st Qu.:2.000 ## Median :0.0000 Median :4.000 Median :2.000 ## Mean :0.4062 Mean :3.688 Mean :2.812 ## 3rd Qu.:1.0000 3rd Qu.:4.000 3rd Qu.:4.000 ## Max. :1.0000 Max. :5.000 Max. :8.000

library(Hmisc) # load the Hmisc package (but the describe function also exists in the psych package) describe(mtcars)



## mtcars ## ## 11 Variables 32 Observations ## --------------------------------------------------------------------------- ## mpg ## n missing distinct Info Mean Gmd .05 .10 ## 32 0 25 0.999 20.09 6.796 12.00 14.34 ## .25 .50 .75 .90 .95 ## 15.43 19.20 22.80 30.09 31.30 ## ## lowest : 10.4 13.3 14.3 14.7 15.0, highest: 26.0 27.3 30.4 32.4 33.9 ## --------------------------------------------------------------------------- ## cyl ## n missing distinct Info Mean Gmd ## 32 0 3 0.866 6.188 1.948 ## ## Value 4 6 8 ## Frequency 11 7 14 ## Proportion 0.344 0.219 0.438 ## --------------------------------------------------------------------------- ## disp ## n missing distinct Info Mean Gmd .05 .10 ## 32 0 27 0.999 230.7 142.5 77.35 80.61 ## .25 .50 .75 .90 .95 ## 120.83 196.30 326.00 396.00 449.00 ## ## lowest : 71.1 75.7 78.7 79.0 95.1, highest: 360.0 400.0 440.0 460.0 472.0 ## --------------------------------------------------------------------------- ## hp ## n missing distinct Info Mean Gmd .05 .10 ## 32 0 22 0.997 146.7 77.04 63.65 66.00 ## .25 .50 .75 .90 .95 ## 96.50 123.00 180.00 243.50 253.55 ## ## lowest : 52 62 65 66 91, highest: 215 230 245 264 335 ## --------------------------------------------------------------------------- ## drat ## n missing distinct Info Mean Gmd .05 .10 ## 32 0 22 0.997 3.597 0.6099 2.853 3.007



## .25 .50 .75 .90 .95 ## 3.080 3.695 3.920 4.209 4.314 ## ## lowest : 2.76 2.93 3.00 3.07 3.08, highest: 4.08 4.11 4.22 4.43 4.93 ## --------------------------------------------------------------------------- ## wt ## n missing distinct Info Mean Gmd .05 .10 ## 32 0 29 0.999 3.217 1.089 1.736 1.956 ## .25 .50 .75 .90 .95 ## 2.581 3.325 3.610 4.048 5.293 ## ## lowest : 1.513 1.615 1.835 1.935 2.140, highest: 3.845 4.070 5.250 5.345 5.424 ## --------------------------------------------------------------------------- ## qsec ## n missing distinct Info Mean Gmd .05 .10 ## 32 0 30 1 17.85 2.009 15.05 15.53 ## .25 .50 .75 .90 .95 ## 16.89 17.71 18.90 19.99 20.10 ## ## lowest : 14.50 14.60 15.41 15.50 15.84, highest: 19.90 20.00 20.01 20.22 22.90 ## --------------------------------------------------------------------------- ## vs ## n missing distinct Info Sum Mean Gmd ## 32 0 2 0.739 14 0.4375 0.5081 ## ## --------------------------------------------------------------------------- ## am ## n missing distinct Info Sum Mean Gmd ## 32 0 2 0.724 13 0.4062 0.498 ## ## --------------------------------------------------------------------------- ## gear ## n missing distinct Info Mean Gmd ## 32 0 3 0.841 3.688 0.7863 ## ## Value 3 4 5 ## Frequency 15 12 5 ## Proportion 0.469 0.375 0.156 ## ---------------------------------------------------------------------------



## carb ## n missing distinct Info Mean Gmd ## 32 0 6 0.929 2.812 1.718 ## ## Value 1 2 3 4 6 8 ## Frequency 7 10 3 10 1 1 ## Proportion 0.219 0.312 0.094 0.312 0.031 0.031 ## ---------------------------------------------------------------------------

str(mtcars) # this is what you would get by clicking the >Arrow button next to the mtcars object in the Environment

## 'data.frame': 32 obs. of 11 variables: ## $ mpg : num 21 21 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 ... ## $ cyl : num 6 6 4 6 8 6 8 4 4 6 ... ## $ disp: num 160 160 108 258 360 ... ## $ hp : num 110 110 93 110 175 105 245 62 95 123 ... ## $ drat: num 3.9 3.9 3.85 3.08 3.15 2.76 3.21 3.69 3.92 3.92 ... ## $ wt : num 2.62 2.88 2.32 3.21 3.44 ... ## $ qsec: num 16.5 17 18.6 19.4 17 ... ## $ vs : num 0 0 1 1 0 1 0 1 1 1 ... ## $ am : num 1 1 1 0 0 0 0 0 0 0 ... ## $ gear: num 4 4 4 3 3 3 3 4 4 4 ... ## $ carb: num 4 4 1 1 2 1 4 2 2 4 ...

Checking normalityVerify distribution, and check if variables follow a Normal distribution

A common assumption necessary for many statistical analyses is that the data is approximately normally distributed. There are several ways of testing this.

For visual inspection, use a histogram and the qqPlot (quantile-quantile plot).

For the qqplot we sort our data, plot it against a specially-designed x-axis based on our reference distribution (e.g., the Gaussian “bell curve”), andlook to see whether the points lie approximately on a straight line.



If the data is normally distributed, the points will align on the diagonal of the qqplot

The variable “weight” in the “ChickWeight” dataset clearly does NOT follow a normal distribution

# load the 'car'' package to use the function qqPlot() library(car) par(mfrow = c(1, 2)) hist(ChickWeight$weight) qqPlot(ChickWeight$weight)

## [1] 400 399

For a numerical test of normality, use the Shapiro-Wilk test. If the test is signifiacnt (p<.05), then the data is NOT normally distributed.

But keep in mind that tiny variations from normality can lead to a significant Shapiro-Wilk test if the sample size is large (or the dataset containsmany trials). Also, the sample size must be between 3 and 5000. Otherwise the test won’t work!



Thus do not blindly rely on this test, and always look at the histogram and qqplot too.

shapiro.test(ChickWeight$weight)

## ## Shapiro-Wilk normality test ## ## data: ChickWeight$weight ## W = 0.90866, p-value < 2.2e-16

Skewness, kurtosis, and the Shapiro-Wilk test are also obtained if using the stat.desc() function from the pastecs package.

Put norm = TRUE to include the Shapiro-Wilk test

library(pastecs) stat.desc(ChickWeight$weight, basic = FALSE, norm = TRUE)

## median mean SE.mean CI.mean.0.95 var ## 1.030000e+02 1.218183e+02 2.956204e+00 5.806232e+00 5.051223e+03 ## std.dev coef.var skewness skew.2SE kurtosis ## 7.107196e+01 5.834258e-01 9.587845e-01 4.717394e+00 3.372147e-01 ## kurt.2SE normtest.W normtest.p ## 8.309955e-01 9.086615e-01 3.873675e-18

Checking homogeneity of varianceAnother assumption for parametric statistics is the homogeneity of variances (Data from multiple groups have the same variance).

It can be tested with Levene’s test using the function leveneTest() from the car package (centering on the median by default).

If the Levene test is significant, it means that the assumption of homogeneity of variances has been violated.

library(car) leveneTest(ChickWeight$weight, ChickWeight$Diet, center = median)



## Levene's Test for Homogeneity of Variance (center = median) ## Df F value Pr(>F) ## group 3 9.6001 3.418e-06 *** ## 574 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Indexing elements in a vector or dataframeA straightforward way to index elements in a vector or matrix (data frame) is by giving the element number inside square brackets. If you want only1 element of a vector, then 1 number in brackets is enough.

# For example, let's create a vector called 'whatever' with 6 numbers: whatever <- c(1, 2, 3, 4, 5, 6) whatever

## [1] 1 2 3 4 5 6

# btw, you get the same by writing it like this whatever <- c(1:6) # the colon tells R to include the numbers 1 and 6, and create all the numbers increasing by 1 in between whatever # let's check again

## [1] 1 2 3 4 5 6

Btw, in Matlab (another program that psychologists and neuroscientists like to use) you could easily create a vector in which numbers increase by2: whatever_2 <- c(1:2:6) but unfortunately this does not work in R (you can do it with the function seq() though)

Now lets extract the 3rd number in the vector “whatever”.

whatever[3] # Can you guess what it will be?

## [1] 3



And the 2nd to 5th element of “whatever”:

whatever[2:5]

## [1] 2 3 4 5

In case you were wondering what they are, the numbers in [square brackets] on the left of the output help you see how many elements there are

With dataframes it works slightly differently. When you work with dataframes, you need to input 2 numbers, i.e. rows and columns (in this order!)separated by comma.

Indexing of a Matlab matrix (https://www.mathworks.com/company/newsletters/articles/matrix-indexing-in-matlab.html) works similarly, with thedifference that in Matlab you could also do linear indexing, which means going down the columns one by one.

mtcars[, 3] # all the elements (rows) of the 3rd column in the dataframe mtcars

## [1] 160.0 160.0 108.0 258.0 360.0 225.0 360.0 146.7 140.8 167.6 167.6 ## [12] 275.8 275.8 275.8 472.0 460.0 440.0 78.7 75.7 71.1 120.1 318.0 ## [23] 304.0 350.0 400.0 79.0 120.3 95.1 351.0 145.0 301.0 121.0

mtcars[3, ] # all the elements (columns) of the 3rd row in the dataframe mtcars

## mpg cyl disp hp drat wt qsec vs am gear carb ## Datsun 710 22.8 4 108 93 3.85 2.32 18.61 1 1 4 1

mtcars[1:5, c(3, 4)] # The first 5 rows of columns 3 and 4 in the dataframe mtcars

## disp hp ## Mazda RX4 160 110 ## Mazda RX4 Wag 160 110 ## Datsun 710 108 93 ## Hornet 4 Drive 258 110 ## Hornet Sportabout 360 175

https://www.mathworks.com/company/newsletters/articles/matrix-indexing-in-matlab.html



# this gives the last 17 rows and the last 6 columns of the dataframe mtcars, based on the total number of rows and # columns mtcars[(nrow(mtcars)/2):nrow(mtcars), ceiling((ncol(mtcars)/2)):ncol(mtcars)]

## wt qsec vs am gear carb ## Lincoln Continental 5.424 17.82 0 0 3 4 ## Chrysler Imperial 5.345 17.42 0 0 3 4 ## Fiat 128 2.200 19.47 1 1 4 1 ## Honda Civic 1.615 18.52 1 1 4 2 ## Toyota Corolla 1.835 19.90 1 1 4 1 ## Toyota Corona 2.465 20.01 1 0 3 1 ## Dodge Challenger 3.520 16.87 0 0 3 2 ## AMC Javelin 3.435 17.30 0 0 3 2 ## Camaro Z28 3.840 15.41 0 0 3 4 ## Pontiac Firebird 3.845 17.05 0 0 3 2 ## Fiat X1-9 1.935 18.90 1 1 4 1 ## Porsche 914-2 2.140 16.70 0 1 5 2 ## Lotus Europa 1.513 16.90 1 1 5 2 ## Ford Pantera L 3.170 14.50 0 1 5 4 ## Ferrari Dino 2.770 15.50 0 1 5 6 ## Maserati Bora 3.570 14.60 0 1 5 8 ## Volvo 142E 2.780 18.60 1 1 4 2

But a really cool thing in R is that it is very easy to index a column in a dataframe based on its name (since 2013, this is also possible when usingthe “Table” format in Matlab, but not as easy as it is in R).

# for example, I could first check again the names of the columns in mtcars names(mtcars)


# or use colnames() which is basically the same, use rownames() for the row names colnames(mtcars)




# then I can index with the '$' sign the column called 'hp', which describes the power of a motor in horse power mtcars$hp # this gives you all the elements (rows) in the column hp

## [1] 110 110 93 110 175 105 245 62 95 123 123 180 180 180 205 215 230 ## [18] 66 52 65 97 150 150 245 175 66 91 113 264 175 335 109

# notice that if you extract a column from a dataframe and save it in a new object it no longer will be a dataframe! mtcars_hp <- mtcars$hp class(mtcars_hp) # we now got a vector of numeric elements

## [1] "numeric"

typeof(mtcars_hp) # for R numbers are of the type 'double'

## [1] "double"

# Oddly, if you extract a row from a dataframe it is still a dataframe (some things are just inconsistent I guess) mtcars_row3 <- mtcars[3, ] class(mtcars_row3)

## [1] "data.frame"

Coerce to data.frameWe could however also force mtcars_hp to be a dataframe, just as mtcars, from where it originally came from. This we do with as.data.frame()



mtcars_hp <- as.data.frame(mtcars$hp) class(mtcars_hp)

## [1] "data.frame"

## Finding unique values What are all the unique numbers of cylinders that cars in the mtcars dataset have?

unique(mtcars$cyl)

## [1] 6 4 8

Or do it with the table() function, which also tells you how many car models exist for each number of cylinders

table(mtcars$cyl)

## ## 4 6 8 ## 11 7 14

In your own datasets the table() function might come handy to quickly check how many trials you have per participant and per condition!

## Sorting The function order() gives the ranked position of each element when it is applied to a variable

# for example, we can find the ascending rank of the elements of a a

## [1] 0.0 NA 3.2 -15.0 2.0

order(a)

## [1] 4 1 5 3 2



The smallest element of a is -15, and it is in 4th position. The next smallest element is 0, which is the 1st element of a…. and so forth Knowing this, we can use the output of order(a) to reshuffle a:

a[order(a)]

## [1] -15.0 0.0 2.0 3.2 NA

It works similarly with data frames:

# sort data frame mtcars by 1 variable new_order <- order(mtcars$mpg) # get the indexes corresponding to the smallest to largest mpg mtcars_sort <- mtcars[new_order, ] # now put the rows of mtcars into the new order head(mtcars_sort)

## mpg cyl disp hp drat wt qsec vs am gear carb ## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 ## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 ## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 ## Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4 ## Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4 ## Maserati Bora 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8

# sort matrix mtcars by 2 variables: miles per gallon and displacement new_order <- order(mtcars$mpg, mtcars$disp) # use mpg and disp mtcars_sort <- mtcars[new_order, ] head(mtcars_sort)

## mpg cyl disp hp drat wt qsec vs am gear carb ## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 ## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 ## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 ## Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4 ## Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4 ## Maserati Bora 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8



# sort by mpg ascending, and by cyl descending (by putting a minus sign) new_order <- order(mtcars$mpg, -mtcars$cyl) mtcars_sort <- mtcars[new_order, ] head(mtcars_sort)

## mpg cyl disp hp drat wt qsec vs am gear carb ## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4 ## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4 ## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4 ## Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4 ## Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4 ## Maserati Bora 15.0 8 301 335 3.54 3.570 14.60 0 1 5 8

## Replicate Create a vector filled with 10 times the same number:

c(rep(7, 10))

## [1] 7 7 7 7 7 7 7 7 7 7

Create a vector filled with 5 times the same string:

c(rep("hello", 5))

## [1] "hello" "hello" "hello" "hello" "hello"

## Dealing with missing values (For a really good video tutorial (including hands-on exercises) on missing and special values see here(https://campus.datacamp.com/courses/cleaning-data-in-r/1828?ex=9) )

Missing values are represented by NA (not available), but sometimes also by a dot (“.”) or an empty space.

The function is.na() helps you to find missing data in your dataset

https://campus.datacamp.com/courses/cleaning-data-in-r/1828?ex=9



# let's create a dataframe with 4 rows, 2 columns, and 3 missing values mydata <- data.frame(A = c(2, 5, NA, 7), B = c(NA, 1, NA, 9)) mydata

## A B ## 1 2 NA ## 2 5 1 ## 3 NA NA ## 4 7 9

# then check for missing data is.na(mydata)

## A B ## [1,] FALSE TRUE ## [2,] FALSE FALSE ## [3,] TRUE TRUE ## [4,] FALSE FALSE

# To simply know if there are ANY missing data use any(): any(is.na(mydata))

## [1] TRUE

# For R, TRUE == 1, and FALSE == 0 Therefore, we can count the number of missing data with sum(is.na(mydata))

## [1] 3

Another way to see the number of NAs is summary()

summary(mydata)



## A B ## Min. :2.000 Min. :1 ## 1st Qu.:3.500 1st Qu.:3 ## Median :5.000 Median :5 ## Mean :4.667 Mean :5 ## 3rd Qu.:6.000 3rd Qu.:7 ## Max. :7.000 Max. :9 ## NA's :1 NA's :2

complete.cases() tells you which rows do contain NAs (FALSE), and which ones do not (TRUE)

complete.cases(mydata)

## [1] FALSE TRUE FALSE TRUE

Use na.omit() to remove cases (i.e. rows) with NAs

# the result is a dataframe with only 2 rows and 2 columns na.omit(mydata)

## A B ## 2 5 1 ## 4 7 9

Many functions only work if you indicate to remove missing values, which you do with na.rm = TRUE :

mean(mydata$A) # returns NA

## [1] NA

mean(mydata$A, na.rm = TRUE) # this works

## [1] 4.666667



colMeans(mydata) # returns NA NA

## A B ## NA NA

colMeans(mydata, na.rm = TRUE) # this works

## A B ## 4.666667 5.000000

## Trasforming data to deal with non-normality Many common statistical tests assume that the data is normally distributed, and that the variancesare similar across groups (or conditions). Violation of these assumptions can lead to wrong results.

Luckily, there are several ways to “beat” (transform) data to make it more normal. Many types of data transformations can be used. For anoverview, and a useful tool, see the package bestNormalize() Alternatively, use nonparametric statistics.

For example, RT data is typically right-skewed, but can become more normal if you apply a log() or log10() transformation.

# create right-skewed distribution using rexgauss() from the package retimes rt_dist1 <- rexgauss(5000, 300, 100, 200, positive = T) hist(rt_dist1, xlab = "RT in milliseconds", axes = F, main = "Positive Skewed")



# test for normality -> it is not normal shapiro.test(rt_dist1)

## ## Shapiro-Wilk normality test ## ## data: rt_dist1 ## W = 0.91303, p-value < 2.2e-16



# histogram of log-transformed data -> looks much more normal! hist(log(rt_dist1), xlab = "RT in milliseconds", axes = F, main = "after log() transformation")

# test for normality -> still not normal shapiro.test(log(rt_dist1))



## ## Shapiro-Wilk normality test ## ## data: log(rt_dist1) ## W = 0.9873, p-value < 2.2e-16

Thus although the Shapiro-Wilks remained positive, the data looked much more normally distributed after logarithmic transformation.

Based on visual inspection of the histogram, and keeping in mind that for example ANOVA is quite robust against violating the normalityassumption, one could go on and do statistics on this data (but not everybody might agree).

Also keep in mind that the shapiro.test() function does not work with vectors with more than 5000 elements :(

## Subsetting, splitting and merging data frames ### Subsetting We have already seen how to select only parts of a vector or data frame (e.g. mtcars[1:3, 1:3] selects the first 3 rows and columns)

It is also easy to exclude only a few columns and keep all the rest:

# say you want to remove the 2nd column (cyl) and the 4th column (hp) mtcars_pruned <- mtcars[c(-2, -4)] # put a - in front of the columns you want to remove head(mtcars_pruned)

## mpg disp drat wt qsec vs am gear carb ## Mazda RX4 21.0 160 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 160 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 108 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 258 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 360 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 225 2.76 3.460 20.22 1 0 3 1

# or we could do so by indicating the names of the columns; let's first find them cols_to_remove <- which(colnames(mtcars) == "disp" | colnames(mtcars) == "hp") # the | sign means 'or' # then remove them mtcars_pruned <- mtcars[-cols_to_remove] head(mtcars_pruned)



## mpg cyl drat wt qsec vs am gear carb ## Mazda RX4 21.0 6 3.90 2.620 16.46 0 1 4 4 ## Mazda RX4 Wag 21.0 6 3.90 2.875 17.02 0 1 4 4 ## Datsun 710 22.8 4 3.85 2.320 18.61 1 1 4 1 ## Hornet 4 Drive 21.4 6 3.08 3.215 19.44 1 0 3 1 ## Hornet Sportabout 18.7 8 3.15 3.440 17.02 0 0 3 2 ## Valiant 18.1 6 2.76 3.460 20.22 1 0 3 1

Another useful function is subset()

# to select the cars with 4 cylinders and at least 70 horse power: subset(mtcars, cyl == 4 & hp > 70)

## mpg cyl disp hp drat wt qsec vs am gear carb ## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 ## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2 ## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1 ## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 ## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 ## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

### Combining data frames Use cbind() to put data frames next to each other (add columns), and rbind() below each other (add rows)

mtcars_wide <- cbind(mtcars_pruned, mtcars) colnames(mtcars_wide)

## [1] "mpg" "cyl" "drat" "wt" "qsec" "vs" "am" "gear" "carb" "mpg" ## [11] "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"

Or combine them in the other direction

mtcars_long <- rbind(mtcars[1:2, ], mtcars[31:32, ]) # stack the first 2 and the last 2 rows dim(mtcars_long)

## [1] 4 11



But this does not work if the 2 data frames differ in the other dimension (number of variables)

# e.g. you cannot put a data frame with 9 columns on top of one with 11 columns mtcars_doesntwork <- rbind(mtcars_pruned, mtcars) # you get an error

## Error in rbind(deparse.level, ...): numbers of columns of arguments do not match

## Combining datasets with different sizes Use cbind.fill() from package rowr to combine horizontally data frames with different N of rows. E.g. this does not work: cbind(mtcars, iris)

library(rowr) # ATTENTION: if you do not specify 'fill=NA' the rows of the smaller data frame will be repeated! cars_and_flowers_wide <- cbind.fill(mtcars, iris, fill = NA) cars_and_flowers_wide[32:33, 1:12]

## mpg cyl disp hp drat wt qsec vs am gear carb Sepal.Length ## 32 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 5.4 ## 33 NA NA NA NA NA NA NA NA NA NA NA 5.2

Use rbind.fill() from the package plyr to combine vertically data frames with different N of columns. Again, this does not work: rbind(mtcars, iris)

library(plyr) cars_and_flowers_long <- rbind.fill(mtcars, iris) cars_and_flowers_long[32:33, 1:12]

## mpg cyl disp hp drat wt qsec vs am gear carb Sepal.Length ## 32 21.4 4 121 109 4.11 2.78 18.6 1 1 4 2 NA ## 33 NA NA NA NA NA NA NA NA NA NA NA 5.1

Now let’s check which cases are complete (have no missing values)

dim(cars_and_flowers_wide)



## [1] 150 16

complete.cases(cars_and_flowers_wide) # (in this case only the first 32 cases are complete)

## [1] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE ## [23] TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE ## [34] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [45] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [56] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [67] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [78] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [89] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [100] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [111] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [122] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [133] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [144] FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Select subset of non-missing cases in a data frame with na.omit()

dim(cars_and_flowers_wide)

## [1] 150 16

cars_and_flowers_wide_noNA = na.omit(cars_and_flowers_wide) dim(cars_and_flowers_wide_noNA)

## [1] 32 16

### Merging data frames Useful for example to combine RTs with questionnaire data (often in separate files)



# load dataset 'iris' data(iris) head(iris)

## Sepal.Length Sepal.Width Petal.Length Petal.Width Species ## 1 5.1 3.5 1.4 0.2 setosa ## 2 4.9 3.0 1.4 0.2 setosa ## 3 4.7 3.2 1.3 0.2 setosa ## 4 4.6 3.1 1.5 0.2 setosa ## 5 5.0 3.6 1.4 0.2 setosa ## 6 5.4 3.9 1.7 0.4 setosa

# split in 2 different datasets (but both contain the 5th column 'Species') iris_1 <- iris[, c(1, 2, 5)] iris_2 <- iris[, c(3, 4, 5)] # then merge them again (order matters) iris_merged <- merge(iris_2, iris_1, by = "Species") head(iris_merged)

## Species Petal.Length Petal.Width Sepal.Length Sepal.Width ## 1 setosa 1.4 0.2 5.1 3.5 ## 2 setosa 1.4 0.2 4.9 3.0 ## 3 setosa 1.4 0.2 4.7 3.2 ## 4 setosa 1.4 0.2 4.6 3.1 ## 5 setosa 1.4 0.2 5.0 3.6 ## 6 setosa 1.4 0.2 5.4 3.9

#### Merge by two variables

iris_3 <- iris[, c(1, 4, 5)] iris_merged <- merge(iris_3, iris_2, by = c("Species", "Petal.Width")) head(iris_merged)



## Species Petal.Width Sepal.Length Petal.Length ## 1 setosa 0.1 4.9 1.4 ## 2 setosa 0.1 4.9 1.1 ## 3 setosa 0.1 4.9 1.5 ## 4 setosa 0.1 4.9 1.4 ## 5 setosa 0.1 4.9 1.5 ## 6 setosa 0.1 4.3 1.4

## Change data frame from LONG to WIDE format In the wide format, a subject’s repeated responses will be in a single row, and each responseis in a separate column.

In the long format, each row is 1 time point x subject. So each subject will have data in multiple rows. Any variables that do not change across time(e.g. subjects’ gender) will have the same value in all the rows.

To change from LONG to WIDE use dcast() from the “reshape2” package

data("ChickWeight") # import the chicken weight data # somehow the order of the levels of the factor $Chick are messed up, so let's change them ChickWeight$Chick <- as.factor(as.numeric(ChickWeight$Chick)) some(ChickWeight)

## weight Time Chick Diet ## 33 163 16 14 1 ## 136 72 8 16 1 ## 145 48 2 4 1 ## 168 41 0 3 1 ## 235 64 4 23 2 ## 358 129 10 37 3 ## 372 144 14 31 3 ## 389 41 0 40 3 ## 510 78 6 42 4 ## 515 147 16 42 4

length(unique(ChickWeight$Time))

## [1] 12



The weight of most chicken was measured at 12 different time points. But some chicken were measured less often. You can verify with the table() function

# see the first 6 rows (chicken). A 0 indicates that no measurement occurred head(table(ChickWeight$Chick, ChickWeight$Time))

## ## 0 2 4 6 8 10 12 14 16 18 20 21 ## 1 1 1 0 0 0 0 0 0 0 0 0 0 ## 2 1 1 1 1 1 1 1 0 0 0 0 0 ## 3 1 1 1 1 1 1 1 1 0 0 0 0 ## 4 1 1 1 1 1 1 1 1 1 1 1 1 ## 5 1 1 1 1 1 1 1 1 1 1 1 1 ## 6 1 1 1 1 1 1 1 1 1 1 1 1

library(reshape2) # now transform to wide format using dcast(): Chicken_wide <- dcast(ChickWeight, # after the name of the dataset put the variables to keep as they are Chick + Diet # then ~ name of variable to split into separate columns ~ Time, # then name of the dependent variable value.var = "weight") head(Chicken_wide)

## Chick Diet 0 2 4 6 8 10 12 14 16 18 20 21 ## 1 1 1 39 35 NA NA NA NA NA NA NA NA NA NA ## 2 2 1 41 45 49 51 57 51 54 NA NA NA NA NA ## 3 3 1 41 49 56 64 68 68 67 68 NA NA NA NA ## 4 4 1 41 48 53 60 65 67 71 70 71 81 91 96 ## 5 5 1 42 51 59 68 85 96 90 92 93 100 100 98 ## 6 6 1 41 47 54 58 65 73 77 89 98 107 115 117



# improve the header colnames(Chicken_wide) # let's check the column names first

## [1] "Chick" "Diet" "0" "2" "4" "6" "8" "10" ## [9] "12" "14" "16" "18" "20" "21"

ts <- rep("t_", 12) # create a vector with 12 times "t_" newnames <- paste(ts, colnames(Chicken_wide)[3:14], sep="") # concatenate "t_" with colnames colnames(Chicken_wide)[3:14] <- newnames head(Chicken_wide)

## Chick Diet t_0 t_2 t_4 t_6 t_8 t_10 t_12 t_14 t_16 t_18 t_20 t_21 ## 1 1 1 39 35 NA NA NA NA NA NA NA NA NA NA ## 2 2 1 41 45 49 51 57 51 54 NA NA NA NA NA ## 3 3 1 41 49 56 64 68 68 67 68 NA NA NA NA ## 4 4 1 41 48 53 60 65 67 71 70 71 81 91 96 ## 5 5 1 42 51 59 68 85 96 90 92 93 100 100 98 ## 6 6 1 41 47 54 58 65 73 77 89 98 107 115 117

## Change data frame from WIDE to LONG format Use the melt() function from the “reshape2” package

head(Chicken_wide)

## Chick Diet t_0 t_2 t_4 t_6 t_8 t_10 t_12 t_14 t_16 t_18 t_20 t_21 ## 1 1 1 39 35 NA NA NA NA NA NA NA NA NA NA ## 2 2 1 41 45 49 51 57 51 54 NA NA NA NA NA ## 3 3 1 41 49 56 64 68 68 67 68 NA NA NA NA ## 4 4 1 41 48 53 60 65 67 71 70 71 81 91 96 ## 5 5 1 42 51 59 68 85 96 90 92 93 100 100 98 ## 6 6 1 41 47 54 58 65 73 77 89 98 107 115 117



data_long <- melt(Chicken_wide, # id.variables = all the variables to keep w/o splitting id.vars=c("Chick", "Diet"), # measure.vars = the columns to put from wide into long format measure.vars=c(colnames(Chicken_wide)[3:14]), # Name of the destination column variable.name="Time", # column that the measurement came from value.name="weight") head(data_long)

## Chick Diet Time weight ## 1 1 1 t_0 39 ## 2 2 1 t_0 41 ## 3 3 1 t_0 41 ## 4 4 1 t_0 41 ## 5 5 1 t_0 42 ## 6 6 1 t_0 41

If you leave out measure.vars melt() will automatically use all the remaining variables as id.vars. The reverse is true if you leave out id.vars.

## Arithmetic operators



## ## ----------------------------- ## Operator Description ## ---------- ------------------ ## + addition ## ## - subtraction ## ## * multiplication ## ## / division ## ## ^ or ** exponentiation ## ## %% modulus ## ## %/% integer division ## -----------------------------

Examples with arithmetic operators

# divide column hp by column cyl mtcars$hp/mtcars$cyl

## [1] 18.33333 18.33333 23.25000 18.33333 21.87500 17.50000 30.62500 ## [8] 15.50000 23.75000 20.50000 20.50000 22.50000 22.50000 22.50000 ## [15] 25.62500 26.87500 28.75000 16.50000 13.00000 16.25000 24.25000 ## [22] 18.75000 18.75000 30.62500 21.87500 16.50000 22.75000 28.25000 ## [29] 33.00000 29.16667 41.87500 27.25000

# use the modulus to check which numbers in column gear are odd or even mtcars$gear%%2 # 0 is even, 1 is odd

## [1] 0 0 0 1 1 1 1 0 0 0 0 1 1 1 1 1 1 0 0 0 1 1 1 1 1 0 1 1 1 1 1 0

## Logical operators



## ## -------------------------------------- ## Operator Description ## ---------- --------------------------- ## < less than ## ## <= less than or equal to ## ## > greater than ## ## >= greater than or equal to ## ## == exactly equal to ## ## != not equal to ## ## ! not (something) ## ## | or ## ## & and ## ## isTRUE() test if something is true ## --------------------------------------

Examples with logical operators

# find which cars have 2 carburetors mtcars$carb == 2

## [1] FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE TRUE FALSE FALSE ## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE TRUE FALSE FALSE TRUE ## [23] TRUE FALSE TRUE FALSE TRUE TRUE FALSE FALSE FALSE TRUE

# find which cars DO NOT HAVE 2 carburetors mtcars$carb != 2



## [1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE ## [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE ## [23] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE

# or !mtcars$carb == 2

## [1] TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE FALSE TRUE TRUE ## [12] TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE TRUE TRUE FALSE ## [23] FALSE TRUE FALSE TRUE FALSE FALSE TRUE TRUE TRUE FALSE

# find cars that have at least 100 hp and less than 8 cylinders (mtcars$hp >= 100 & mtcars$cyl < 8)

## [1] TRUE TRUE FALSE TRUE FALSE TRUE FALSE FALSE FALSE TRUE TRUE ## [12] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE ## [23] FALSE FALSE FALSE FALSE FALSE TRUE FALSE TRUE FALSE TRUE

## Compute means by subject and/or condition Use function aggregate() in package stats

# load data 'ChickWeight', which contains Weight versus age of chicks on different diets data(ChickWeight) head(ChickWeight)

## weight Time Chick Diet ## 1 42 0 1 1 ## 2 51 2 1 1 ## 3 59 4 1 1 ## 4 64 6 1 1 ## 5 76 8 1 1 ## 6 93 10 1 1



Weight_by_Diet <- aggregate(ChickWeight$weight, by = list(Diet = ChickWeight$Diet), FUN = mean, na.rm = TRUE) colnames(Weight_by_Diet)[2] <- "mean weight" # let's add a nicer header for the 2nd column Weight_by_Diet

## Diet mean weight ## 1 1 102.6455 ## 2 2 122.6167 ## 3 3 142.9500 ## 4 4 135.2627

You could also aggregate by time instead

Weight_by_Time <- aggregate(ChickWeight$weight, by = list(Time = ChickWeight$Time), FUN = mean, na.rm = TRUE) colnames(Weight_by_Time)[2] <- "mean weight" Weight_by_Time

## Time mean weight ## 1 0 41.06000 ## 2 2 49.22000 ## 3 4 59.95918 ## 4 6 74.30612 ## 5 8 91.24490 ## 6 10 107.83673 ## 7 12 129.24490 ## 8 14 143.81250 ## 9 16 168.08511 ## 10 18 190.19149 ## 11 20 209.71739 ## 12 21 218.68889

Or aggregate using a different function (e.g. SD)

Weight_by_Time <- aggregate(ChickWeight$weight, by = list(Time = ChickWeight$Time), FUN = sd, na.rm = TRUE) colnames(Weight_by_Time)[2] <- "SD time" Weight_by_Time



## Time SD time ## 1 0 1.132272 ## 2 2 3.688316 ## 3 4 4.495179 ## 4 6 9.012038 ## 5 8 16.239780 ## 6 10 23.987277 ## 7 12 34.119600 ## 8 14 38.300412 ## 9 16 46.904079 ## 10 18 57.394757 ## 11 20 66.511708 ## 12 21 71.510273

Or aggregate by 2 variables (Time & Diet)

Weight_by_Time_and_Diet <- aggregate(ChickWeight$weight, by = list(Time = ChickWeight$Time, Diet = ChickWeight$Diet), FUN = mean, na.rm = TRUE) colnames(Weight_by_Time_and_Diet)[3] <- "Mean Weight" some(Weight_by_Time_and_Diet)

## Time Diet Mean Weight ## 8 14 1 123.3889 ## 10 18 1 158.9412 ## 13 0 2 40.7000 ## 17 8 2 91.7000 ## 19 12 2 131.3000 ## 23 20 2 205.6000 ## 27 4 3 62.2000 ## 39 4 4 64.5000 ## 42 10 4 126.0000 ## 43 12 4 151.4000

Or use ddply() in package plyr , which also allows to simultaneously compute N, M, and SD



library(plyr) Weight_by_Time_and_Diet <- ddply(ChickWeight, .(Diet, Time), summarise, N = length(weight), M = round(mean(weight, na.rm = TRUE), 2), SD = round(sd(weight, na.rm = TRUE), 2)) head(Weight_by_Time_and_Diet)

## Diet Time N M SD ## 1 1 0 20 41.40 0.99 ## 2 1 2 20 47.25 4.28 ## 3 1 4 19 56.47 4.13 ## 4 1 6 19 66.79 7.76 ## 5 1 8 19 79.68 13.78 ## 6 1 10 19 93.05 22.54

tail(Weight_by_Time_and_Diet)

## Diet Time N M SD ## 43 4 12 10 151.40 15.28 ## 44 4 14 10 161.80 15.73 ## 45 4 16 10 182.00 25.30 ## 46 4 18 10 202.90 33.56 ## 47 4 20 9 233.89 37.57 ## 48 4 21 9 238.56 43.35

Plotting (with base graphics)Plotting (or data visualisation) is essential to 1) better understand your data 2) summarizing and conveying the results of your analyses

There are several sets of plotting functions in R. We will start with the “Base graphics” system, which offers quick but limited plotting possibilities.

The “ggplot2” system, which we’ll discuss further on, is a bit more complicated to use but provides much more options and possibilities.



When you create a plot in RStudio, it is automatically shown in the “Plots” pan on the lower right.

From there, you could export it to an image (e.g. jpg or png) or a .pdf using the GUI (later, we will see other ways how to do this)

See here (https://www.datacamp.com/courses/data-visualization-in-r) for a hands-on tutorial on plotting and data visualization in R

Scatter plot (2 variables)Scatterplots can easily be made using the plot() function. This is a generic function, meaning that it can adapt to different data types. With it, onecan create scatterplots, dot plots, line plots, etc.

Let’s do a simple scatterplot with the 2 variables “hp” (horsepower) and “mpg” (miles per gallon) from the dataset mtcars

plot(mtcars$hp, mtcars$mpg)

https://www.datacamp.com/courses/data-visualization-in-r



As expected, there seems to be a negative correlation between horsepower and the number of miles you can do with one gallon of gas, i.e. thestronger the motor, the higher the gas consumption and thus the lower the miles per gallon.

Now let’s modify the axis labels and add a title

plot(mtcars$hp, mtcars$mpg, xlab = "Horse Power", ylab = "Miles x Gallon", main = "Horse Power by Miles x Gallon in mtcars")



Notice that you could plot the same by using R’s formula interface: plot(mpg ~ hp, data = mtcars) , in this case the y-axis is defined first.

To see more options of the plot() function type ?graphics::plot and Enter in the console

The par callSets graphics parameters (fonts, colors, axes, titles), and keeps them until the next par call.

Useful for example to put multiple plots in the same figure.



# create a plot array with 1 row and 2 columns par(mfrow = c(1, 2)) plot(mtcars$hp, mtcars$mpg, xlab = "Horse Power", ylab = "Miles x Gallon", main = "plot 1") plot(mtcars$cyl, mtcars$carb, xlab = "cylinders", ylab = "carburetors", main = "plot 2")

But there are many more parameters. Check the current ones with par() See this link (https://www.statmethods.net/advgraphs/parameters.html) for a description.

CAREFUL: you cannot use par to combine plots made with ggplot2 (see below).

Scatter plot (more variables)In order to explore the data, one could also use scatterplots comparing more than 2 variables.

Let’s create a plot array with the first 4 variables in the mtcars dataset:

https://www.statmethods.net/advgraphs/parameters.html



plot(mtcars[, 1:4])

Line plotCould be done with the plot() function, putting the “type” argument to “l”

# let's order mtcars by nunber of cylinders mtcars_sorted <- mtcars[order(mtcars$cyl), ] plot(mtcars_sorted$hp, type = "l", xlab = "cars ordered by cyl", ylab = "horsepower", main = "this graphs doesn't make much sense")



# possible types are 'p' for points, 'l' for lines, 'b' for both, etc. -> see ?plot()

Or do first a scatterplot with plot() and then add the lines with lines()

plot(mtcars_sorted$hp) lines(mtcars_sorted$hp, type = "l")



Sunflower plotSunflowerplots are useful when the same data combination gets repeated many times.

For each repetition, a petal is added: * a black dot without petals = 1 single data point * 2 red petals = 2 identical data points * 3 red petals = 3 identical data points, etc.

sunflowerplot(mtcars$cyl, mtcars$carb, xlab = "number of cylinders", ylab = "number of carburetors", main = "Sunflowerplot of cyl by carb")



Box plotUseful to see the distribution of a numerical variable changes over the levels of a categorical variable, or another numerical variable that has only afew values.

The box of the boxplot includes: 1st quartile, median, and 3rd quartile.

By default when using boxplot() the whiskers extend to the most extreme data point which is no more than 1.5 times the interquartile rangefrom the box. Everything beyond is an outlier (marked with points).

The whiskers can however be changed, e.g. range = 0 causes the whiskers to extend to the data extremes.



Let’s see gas consumption by number of cylinders

boxplot(mtcars$mpg ~ mtcars$cyl, xlab = "cylinders", ylab = "miles per gallon")

# now extend the whiskers to the data extremes (no more outliers) boxplot(mtcars$mpg ~ mtcars$cyl, range = 0, xlab = "cylinders", ylab = "miles per gallon")



Mosaic plotLike a scatterplot between 2 categorical variables, or between numerical values with only a few values.

A line is shown for when a category is absent.

mosaicplot(cyl ~ carb, data = mtcars)



Pie chartsThese look cool but are not recommended. Even the help ( ?pie ) for this function thinks so: “Pie charts are a very bad way of displaying information. The eye is good at judging linear measures and bad at judging relative areas. A bar chartor dot chart is a preferable way of displaying this type of data.”



# first let's get a count of the number of cars per number of carburetors tbl <- table(mtcars$carb) # then let's make a pie chart out of it pie(tbl) title("amount of cars with 1,2,3,4,6, or 8 carburetors")

BarplotNow we show the same data as a bar plot instead. Remember that ideally you’d also add errorbars - this is just a simple example!



tbl <- table(mtcars$carb) # then let's make a bar chart out of it barplot(tbl, xlab = "number of carburetors", ylab = "number of cars") title("amount of cars by carburetors")

This is actually a histogram!

Histogram



Shows the frequencies of occurrences for each level of a given variable. Useful also to quickly see the distribution of the data (is it normal?). Let’s again plot the number of cars that have a specific amount of carburetors, but this time with the function hist()

# ATTENTION: if you do not want to get the counts of cars with 1 or 2 carburetors to be summed together, you need to # change the breaks!! hist(mtcars$carb, breaks = seq(0.5, 8.5, 1), freq = TRUE)

QQ plot



A common assumption necessary for many statistical analyses is that the data is approximately normally distributed. There are several ways of testing this. One of them is the qqPlot (quantile-quantile plot).

For the qqplot we sort our data, plot it against a specially-designed x-axis based on our reference distribution (e.g., the Gaussian “bell curve”), andlook to see whether the points lie approximately on a straight line.

If the data is normally distributed, the points will align on the diagonal of the qqplot

The variable “weight” in the “ChickWeight” dataset does not follow a normal distribution

# load the 'car'' package to use the function qqPlot() par(mfrow = c(1, 2)) hist(ChickWeight$weight) qqPlot(ChickWeight$weight)

## [1] 400 399



But if we select only measures at Time == 16, then the assumption of normality seems more reasonable

index16 = which(ChickWeight$Time == 16) weights = ChickWeight[index16, "weight"] par(mfrow = c(1, 2)) hist(weights, breaks = seq(1, 320, 20)) qqPlot(weights)

## [1] 32 18

And here an example of data that is definitely not normally distributed

library(MASS)



## ## Attaching package: 'MASS'

## The following object is masked from 'package:plotly': ## ## select

par(mfrow = c(1, 2)) hist(Boston$tax) qqPlot(Boston$tax)

## [1] 489 490



Plotting with ggplot2ggplot2 was written by Hadley Wickham, a kiwi statistician, professor, and chief scientist at RStudio (check out his website (http://hadley.nz))

It is “a plotting system for R, based on the grammar of graphics, which tries to take the good parts of base and lattice graphics and none of the badparts. It takes care of many of the fiddly details that make plotting a hassle (like drawing legends) as well as providing a powerful model of graphicsthat makes it easy to produce complex multi-layered graphics.” (http://ggplot2.org (http://ggplot2.org))

Plotting functions in base R can still be useful to quickly explore the data. But if you want anything beyond those simple graphs, it’s best to switch to ggplot2.

“ggplot2 provides a unified interface and set of options, instead of the grab bag of modifiers and special cases required in base graphics. Once youlearn how ggplot2 works, you can use that knowledge for everything from scatter plots and histograms to violin plots and maps.” (R graphicscookbook by Winston Chang, p.7)

“In R’s base graphics functions, each mapping of data properties to visual properties is itw own special case, and changing the mappings mayrequire restructuring your data, issuing completely different graphics commands, or both.” (R graphics cookbook by Winston Chang, p.374)

CAREFUL: you cannot use par to combine plots made with ggplot2 (see below). Use the cowplot or gridExtra packages instead.

DISCLAIMER: I have copied a lot of examples from the R graphics cookbook by Winston Chang!

Bar and line plots using base graphics# Create a 2x3 data frame mydata <- matrix(c(10, 7, 12, 9, 11, 6), nrow = 2, ncol = 3, byrow = TRUE, dimnames = list(c("B1", "B2"), c("A1", "A2", "A3"))) mydata

## A1 A2 A3 ## B1 10 7 12 ## B2 9 11 6

http://hadley.nz/

http://ggplot2.org/



Make a barplot with the data using function barplot() with As along x axis, and bars grouped by Bs

graphics::barplot(mydata, beside = TRUE) # beside=TRUE draws separate bars for each data point, otherwise B1 and B2 overlap with diff colors

To turn the barplot around (Bs along the x axis and bars grouped by As), you first need to restructure the data (e.g. using the function t() fortranspose )

t(mydata)



## B1 B2 ## A1 10 9 ## A2 7 11 ## A3 12 6

Then you can do another barplot:

barplot(t(mydata), beside = TRUE)



So far so good, transposing the data wasn’t that bad, one could say. But things get more complicated if you want to make a lineplot. To do this with base graphics you need to draw the lines for B1 and B2 in separate commands.

plot(mydata[1, ], type = "l") lines(mydata[2, ], type = "l", col = "blue")

And the result is not very convincing:

the lowest point on the blue line (B2) is not visible, because the range of the y-axis was set for the black line (B1)the x-axis is numbered (what are 1.5 and 2.5 doing there?)



Bar and line plots with ggplot2For ggplot2 the data must be in “long” format, and you never need to change it. So first we put mydata in long format using melt() from the reshape2 package:

library(reshape2) mydata_long <- melt(mydata, value.name = "value") colnames(mydata_long)[1:2] <- c("theBs", "theAs") mydata_long

## theBs theAs value ## 1 B1 A1 10 ## 2 B2 A1 9 ## 3 B1 A2 7 ## 4 B2 A2 11 ## 5 B1 A3 12 ## 6 B2 A3 6

Now comes the barplot using ggplot() , and it will look better right away (e.g. notice the axis labels and the legend):

ggplot(mydata_long, aes(x = theAs, y = value, fill = theBs)) + geom_bar(stat = "identity", position = "dodge")



Turning it around and putting the Bs on the x-axis is simple:

ggplot(mydata_long, aes(x = theBs, y = value, fill = theAs)) + geom_bar(stat = "identity", position = "dodge")



If you wanted a line plot instead:

ggplot(mydata_long, aes(x = theBs, y = value, colour = theAs, group = theAs)) + geom_line()



Or this way?

ggplot(mydata_long, aes(x = theAs, y = value, colour = theBs, group = theBs)) + geom_line()



ggplot2’s terminologyDATA: must be data frames in long format, with each variable in a different columnGEOMS: geometric objects, e.g. bars, lines, plotsAESTHETICS: visual properties of geoms, e.g. position of axes, line color, point shapes, etc.SCALES: control how the data is converted in the elements of the plot (the mapping), e.g. greater values of a continuous y-axis appearhigher upGUIDES: e.g. tick marks and axis labels

Many of the elements of ggplots are combined by joining them with the plus sign (+) If you write the ggplot command over several lines, be careful to always put + at the end of the line!



Scatter plots with ggplot2Let’s create some new data for a scatterplot

mydata <- data.frame(xval = 1:4, yval = c(3, 5, 6, 9), group = c("A", "B", "A", "B")) mydata

## xval yval group ## 1 1 3 A ## 2 2 5 B ## 3 3 6 A ## 4 4 9 B

Now we make a simple scatterplot with ggplot() , but before plotting it we save it as “myplot”

myplot <- ggplot(mydata, aes(x = xval, y = yval)) + geom_point() myplot # then plot it by calling its name



You can change or add more elements to the saved plot using + For example let’s plot groups A and B in different colors:

myplot + geom_point(aes(colour = group))



Or you can change the range of the x-axis:

myplot + scale_x_continuous(limits = c(-5, 10))



Change the colors:

myplot + geom_point(aes(colour = group)) + scale_colour_manual(values = c("orange", "forestgreen"))



Scatter plots with plot() vs. ggplot()plot(mtcars$wt, mtcars$mpg)



library(ggplot2) ggplot(mtcars, aes(x = wt, y = mpg)) + geom_point() + theme(panel.background = element_blank()) + # to remove the gray background theme(axis.line = element_line(colour = "black")) # otherwise the axes are nolonger visible



Bar plot with errorbars# let's create some data mydata <- data.frame(group = c("A", "A", "A", "A", "A", "A", "B", "B", "B", "B", "B", "B"), Time = as.factor(c(1, 2, 1, 2, 1, 2, 1, 2, 1, 2, 1, 2)), yval = c(11, 3, 5, 6, 9, 1, 2, 3, 4, 5, 1, 0)) mydata



## group Time yval ## 1 A 1 11 ## 2 A 2 3 ## 3 A 1 5 ## 4 A 2 6 ## 5 A 1 9 ## 6 A 2 1 ## 7 B 1 2 ## 8 B 2 3 ## 9 B 1 4 ## 10 B 2 5 ## 11 B 1 1 ## 12 B 2 0

There are (at least) 2 ways to add errorbars representing standard error of the mean (SE). Here we will discuss both and put the resulting plots next to each other (using par() ).

The 1st way is to compute the SEs yourself (SD/sqrt(N)), and then plot them with geom_errorbar

mydata_summary <- ddply(mydata, c("group", "Time"), summarise, M = mean(yval, na.rm = TRUE), SD = sd(yval, na.rm = TRUE), N = sum(!is.na(yval)), SE = SD/sqrt(N)) mydata_summary # look at the computed parameters

## group Time M SD N SE ## 1 A 1 8.333333 3.055050 3 1.7638342 ## 2 A 2 3.333333 2.516611 3 1.4529663 ## 3 B 1 2.333333 1.527525 3 0.8819171 ## 4 B 2 2.666667 2.516611 3 1.4529663



library(ggplot2) # load ggplot2, if necessary errorbars1 <- ggplot(mydata_summary, aes(x = Time, y = M, fill = group)) + geom_bar(stat = "identity", position = position_dodge(0.5), width = 0.5) + geom_errorbar(aes(ymin = M - SE, ymax = M + SE), position = position_dodge(0.5), width = 0.2) + ggtitle("SEs by hand") # position_dodge defines the spacing between bars (per default it is the same as the bar width, i.e. 0.9) # position='dodge' is the short form of position=position_dodge()

The 2nd (easier) way is to have ggplot do it for you using fun.data = mean_se :

errorbars2 <- ggplot(mydata, aes(x = Time, y = yval, fill = group)) + stat_summary(fun.y = "mean", geom = "bar", position = position_dodge(0.5), width = 0.5) + # the height of the bars represents the mean per Time and group stat_summary(fun.data = mean_se, geom = "errorbar", position = position_dodge(0.5), width = 0.2) + # the errorbars show the SE ggtitle("Automatic SEs")

We use the package cowplot (see here (https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html)) to arrange the 2 plots next toeach other, and label them with A and B. Using this package also makes the plots a bit nicer (white background, etc.)

####### if necessary install and/or load cowplot: install.packages('cowplot') library(cowplot) plot_grid(errorbars1, errorbars2, labels = c("A", "B"))

https://cran.r-project.org/web/packages/cowplot/vignettes/introduction.html



Reassuringly, the 2 graphs look exactly the same.

Box plot with data points (grouped by 1 factor)# ATTENTION: for the x-axis you need a factor, or a vector of characters ggplot(mtcars, aes(as.factor(cyl), mpg)) + geom_boxplot() + geom_jitter(width = 0.2) + # we can move each data point a bit to the right or left, to make it visible labs(x = "N of cylinders")



Box plot with data points (grouped by 2 factors)# plot ggplot(mtcars, aes(x = as.factor(cyl), y = mpg, fill = as.factor(vs))) + geom_boxplot() + geom_point(alpha = 0.5, position = position_jitterdodge(jitter.width = 0.2), aes(group = as.factor(vs)))



#### Violin plot The better boxplot, as it also shows the distribution of the data (a kernel density plot).

# ATTENTION: for the x-axis you need a factor, or a vector of characters ggplot(mtcars, aes(as.factor(cyl), mpg)) + geom_violin() + geom_jitter(width = 0.2) + # we can move each data point a bit to the right or left, to make it visible labs(x = "N of cylinders")



Violin and box plots combinedWe can overlay box plots and violin plots made with ggplot2. For that, we need to specify their positions using te function position_dodge(https://ggplot2.tidyverse.org/reference/position_dodge.html)

ggplot(mtcars, aes(as.factor(cyl), mpg)) + geom_violin(position = position_dodge(width = .04)) + geom_boxplot(width = .05, position = position_dodge(width = .04)) + geom_jitter(width = 0.2) + # we can move each data point a bit to the right or left, to make it visible labs(x = "N of cylinders")

https://ggplot2.tidyverse.org/reference/position_dodge.html



Pirate plotsThe function pirateplot from the yarrr package allows to easily combine box plots, violin plots, and the representation of every single datapoint. There are many built-in color palettes you can choose from

library("yarrr")

pirateplot(formula = mpg ~ cyl, data = mtcars, pal = "southpark", main = "mpg by cyl in mtcars", ylim = c(5, 45))



You can have up to 3 different categorigal IVs for pirate plots

pirateplot(formula = mpg ~ cyl + gear, data = mtcars, pal = "southpark", main = "mpg by cyl and by gears", ylim = c(5, 45))



But, as far as I know, you cannot assign a pirateplot to an object, as you can do with ggplot2. This means that it is difficult to arrange several pirateplots in the same figure (see below).

Arranging several plots in a figureIf you plot with the base graphics package, you can put several plots in 1 figure using the par function But this does not work with graphs madewith ggplot2 , for which one can use the cowplot or gridExtra packages instead.



library(cowplot) plot1 <- ggplot(mtcars, aes(cyl, mpg)) + geom_bar(stat="summary", fun.y = "mean") + labs(x = "N of cylinders") + labs(y = "Average mpg") plot2 <- ggplot(mtcars, aes(as.factor(cyl), mpg)) + geom_violin(position = position_dodge(width = .04)) + geom_boxplot(width = .05, position = position_dodge(width = .04)) + geom_jitter(width = 0.2) + # we can move each data point a bit to the right or left, to make it visible labs(x = "N of cylinders") # let's put plot1 and plot2 together, but in positions 1 and 4 of a 4-plot scenario plot_grid(plot1, NULL, NULL, plot2, labels = c("A", "B", "C", "D"), ncol = 2)

Saving your plots and graphs



You can save your graphs by hand using the Export button on the right-hand Plots tab. Or you define in your script how and where to save a graph (see link (https://www.stat.berkeley.edu/classes/s133/saving.html)).

Here, I save to .pdf, by first defining the name (with the path!), calling pdf() before the plot, and dev.off() after the plot.

Beware that if you save directly to a file, you will not see the plot being drawn in R!

SaveDir <- "YOUR_PATH"

library(cowplot) name <- paste(SaveDir, "/my_amazing_plot.pdf", sep = "") pdf(name, width = 15, height = 8) # alternatively used jpeg() or png() plot_grid(errorbars1, errorbars2, labels = c("A", "B")) dev.off()

Flow controlfor() loopsWhen you need to do the same operation (e.g. importing a subject’s data) many times, you might want to use a for loop.

For loops are part of “flow” commands, and execute according to a counter.

For example in this for loop, we start with the number 1, and then add +1 to it for 5 times. We display number at each iteration

number <- 1 for (counter in 1:5) { number <- number + 1 print(number) }

https://www.stat.berkeley.edu/classes/s133/saving.html



## [1] 2 ## [1] 3 ## [1] 4 ## [1] 5 ## [1] 6

For loops can also be nested. for example, here we first add +1 to a number, and then write the word “Pizza” from once to “number” times next toeach other

number <- 1 for (counter in 1:5) { number <- number + 1 print(number) for (counter2 in 1:number) { print(rep("Pizza", counter2)) } }



## [1] 2 ## [1] "Pizza" ## [1] "Pizza" "Pizza" ## [1] 3 ## [1] "Pizza" ## [1] "Pizza" "Pizza" ## [1] "Pizza" "Pizza" "Pizza" ## [1] 4 ## [1] "Pizza" ## [1] "Pizza" "Pizza" ## [1] "Pizza" "Pizza" "Pizza" ## [1] "Pizza" "Pizza" "Pizza" "Pizza" ## [1] 5 ## [1] "Pizza" ## [1] "Pizza" "Pizza" ## [1] "Pizza" "Pizza" "Pizza" ## [1] "Pizza" "Pizza" "Pizza" "Pizza" ## [1] "Pizza" "Pizza" "Pizza" "Pizza" "Pizza" ## [1] 6 ## [1] "Pizza" ## [1] "Pizza" "Pizza" ## [1] "Pizza" "Pizza" "Pizza" ## [1] "Pizza" "Pizza" "Pizza" "Pizza" ## [1] "Pizza" "Pizza" "Pizza" "Pizza" "Pizza" ## [1] "Pizza" "Pizza" "Pizza" "Pizza" "Pizza" "Pizza"

if… statementWith if statements, we can take decisions, and act accordingly. For example, we could ask if a condition is met (e.g. a number is odd or even), andonly if this is TRUE, another statement gets executed

Here we print “Tiramisu” only if the number is odd, otherwise nothing happens

number <- 5 if ((number%%2) == 1) { print("Tiramisu") }



## [1] "Tiramisu"

if…elseSimilar to if(), but this time something happens also when the statement is NOT TRUE

e.g. print “Tiramisu” for the number 2, and “Mousse au chocolat” for all other numbers

number <- 5 if (number == 2) { print("Tiramisu") } else { print("Mousse au chocolat") }

## [1] "Mousse au chocolat"

if…else ifSimilar to if…else, but more specific.

e.g. print “One” for the number 1, “Two” for the number 2, and “and so forth” for all other numbers

number <- 5 if (number == 1) { print("One") } else if (number == 2) { print("Two") } else { print("and so forth") }

## [1] "and so forth"

ifelse()



Sort of shorthand to if…else statement

Tests for each element of a vector (e.g. a column of a dataframe) if a statement is true or not, and acts accordingly to instructions.

Useful for example to changethe names of the levels of a factor (e.g. imagine you have an IV “Time” with levels 1, 2, and 3; you change them tot_1, t_2, t_3).

The syntax is ifelse(test_expression, x, y)

# check if these numbers are bigger than five. if yes write you are so big, otherwise write you are very tiny numbers <- c(3, 9, 4, 8, 1, 20) ifelse(numbers > 5, "you are so biiig", "you are very tiny")

## [1] "you are very tiny" "you are so biiig" "you are very tiny" ## [4] "you are so biiig" "you are very tiny" "you are so biiig"

# in a data frame, replace small numbes (<5) with 'too small', and big numbers (>5) with 'too large' numbers <- as.data.frame(c(3, 9, 4, 8, 1, 20)) colnames(numbers) <- "number" numbers$number <- ifelse(numbers$number > 5, "too small", "too large")

ifelse() statements can be nested

# in a data frame, replace 'to do R' with 'Yuppee', 'not to do R' with 'Baaah', and 'maybe do R' with 'there is hope' data <- as.data.frame(c("to do R", "not to do R", "maybe do R", "to do R", "maybe do R", "to do R")) colnames(data) <- "motivation" data$motivation <- ifelse(data$motivation == "to do R", "Yuppee", ifelse(data$motivation == "not to do R", "Baaah", "there is hope"))

Create new variable names in loopSometimes you want to create a new variable name for each iteration of a for loop.

For example, to save a plot, which will differ for each iteration, in its own variable.

This will later allow you to manipulate and plot in the same figure, etc.

Use the assign() variable



library(ggplot2) for (iteration in c(4, 6, 8)) { # select cars with that many cylinders temp <- as.data.frame(mtcars[mtcars$cyl == iteration, "mpg"]) colnames(temp) <- "mpg" # create plot myplot <- ggplot(temp, aes(y = mpg)) + geom_boxplot() + ggtitle(paste(iteration, "cylinders")) # create a variable name and save the plot in it var_name <- paste("plot_", iteration, sep = "") assign(var_name, myplot) } cowplot::plot_grid(plot_4, plot_6, plot_8, ncol = 3)



Create a functionCreate a command, save it as a function, and execute it when you want.

Create the function “desert”, which takes a number, and writes “Tiramisu” if the input is 2, or “Lasagne” for all other numbers



desert <- function(number) { if (number == 2) { print("Tiramisu") } else { print("Lasagne") } }

Then use the function by feeding it with different numbers and reading the output

desert(1)

## [1] "Lasagne"

desert(2)

## [1] "Tiramisu"

you can also write the function in an .R script, which you save with the same name (desert.R)

Then, in order to use desert() , source it in R, i.e. tell R the directory you saved it in:

# e.g. if I saved the script as 'desert.R' in the folder 'Z:/Myfolder' source("Z:/Myfolder/desert.R")

desert(4/2)

## [1] "Tiramisu"

Take input from userYou can wait for the user to write some text or numbers, and then use that in further commands

for example, these 2 lines of code ask you what you want to eat, and then display your text input



text <- readline(prompt = "Hello! What do you want to eat today?: ") print(text)

The input could also go straight into a function.

Here, answer y for yes, and n for no, to the question “Do you like R?” What is the answer you got?

R_question <- function() { text <- readline(prompt = "Do you like R? (y/n): ") if (text == "y") { return("Yeah you rule!") } else if (text == "n") { return("Baaaaah!!") } } # run the function R_question()

Of course we could write the R_question() function in a script, save it as R_question.R, tell R where it is (using source() ), and then use it also infuture R sessions, without the need of recreating the function first.

Statistics with RFor help on which statistical test to use, and how to implement it in R (or SAS, Stata, or SPSS), have a look at this link(https://stats.idre.ucla.edu/other/mult-pkg/whatstat/) But let’s start with descriptive statistics.

# on average, how many miles with 1 gallon can you ride with the cars in the mtcars dataset? mean(mtcars$mpg)

## [1] 20.09062

# what is the maximum of miles you can do with 1 gallon? max(mtcars$mpg)

https://stats.idre.ucla.edu/other/mult-pkg/whatstat/



## [1] 33.9

# and the minimum? min(mtcars$mpg)

## [1] 10.4

# how many car models with 4, 6, or 8 cylinders? table(mtcars$cyl)

## ## 4 6 8 ## 11 7 14

Descriptives (ddply)With the package ddply you can easily obtain a lot of descriptive statistics grouped in a neat and convenient way. For example, we could want toknow how many cars with different numbers of cylinders exist in the mtcars dataset, what the percentage of cars with 4, 6, or 8 cylinders is, andwhat their minimum, maximum, and average miles per gallon value is.

means <- ddply(mtcars, c("cyl"), summarise, N_cars = length(cyl), percentage_cars = round((100/nrow(mtcars) * N_cars)), mpg_min = min(mpg), mpg_max = max(mpg), mpg_mean = round(mean(mpg))) means

## cyl N_cars percentage_cars mpg_min mpg_max mpg_mean ## 1 4 11 34 21.4 33.9 27 ## 2 6 7 22 17.8 21.4 20 ## 3 8 14 44 10.4 19.2 15

Correlation



A correlation measures the strength and the direction of an association between 2 variables. The strength varies between +1 and -1, where 1 indicates a perfect degree of association and 0 no association at all. The direction can be positive(+) or negative (-).

Common measures of correlations are called: Pearson and Spearman.

For the Pearson’s r correlation, both variables should be normally distributed.

Spearman’s rho instead is a non-parametric rank correlation (data is ranked), which does NOT require the data to be normally distributed.

r and rho are also effect sizes (they should therefore be writte in italic).

According to Cohen (https://en.wikipedia.org/wiki/Effect_size#Pearson_r_or_correlation_coefficient), Pearsons’s r correlation coefficients can beroughly classified like this (the same categorization applies to Spearman’s rho):

.10< x <.30 -> small association

.30< x <.5 -> medium associationand >.50 -> large association

The square of r (r2) determines how much variance is shared by the 2 correlated variables

To correlate several variables with each other, use cor() from the stats package

# here we correlate the last 5 columns of the mtcars dataset with each other cor(mtcars[, 4:8], use = "complete.obs", method = "spearman")

## hp drat wt qsec vs ## hp 1.0000000 -0.52012499 0.7746767 -0.66660602 -0.7515934 ## drat -0.5201250 1.00000000 -0.7503904 0.09186863 0.4474575 ## wt 0.7746767 -0.75039041 1.0000000 -0.22540120 -0.5870162 ## qsec -0.6666060 0.09186863 -0.2254012 1.00000000 0.7915715 ## vs -0.7515934 0.44745745 -0.5870162 0.79157148 1.0000000

# with 'complete.obs' missing values are handled by casewise deletion

To see the p value use cor.test() (but only works for 1 coefficient)

https://en.wikipedia.org/wiki/Effect_size#Pearson_r_or_correlation_coefficient



cor.test(mtcars[, 4], mtcars[, 5], use = "complete.obs", method = "spearman")

## ## Spearman's rank correlation rho ## ## data: mtcars[, 4] and mtcars[, 5] ## S = 8293.8, p-value = 0.002278 ## alternative hypothesis: true rho is not equal to 0 ## sample estimates: ## rho ## -0.520125

Even better is function rcorr() from the Hmisc package. It gives correlation coefficients and significance levels together.

library(Hmisc) rcorr(as.matrix(mtcars[, 4:8]), type = "spearman") # Input must be a matrix!

## hp drat wt qsec vs ## hp 1.00 -0.52 0.77 -0.67 -0.75 ## drat -0.52 1.00 -0.75 0.09 0.45 ## wt 0.77 -0.75 1.00 -0.23 -0.59 ## qsec -0.67 0.09 -0.23 1.00 0.79 ## vs -0.75 0.45 -0.59 0.79 1.00 ## ## n= 32 ## ## ## P ## hp drat wt qsec vs ## hp 0.0023 0.0000 0.0000 0.0000 ## drat 0.0023 0.0000 0.6170 0.0102 ## wt 0.0000 0.0000 0.2148 0.0004 ## qsec 0.0000 0.6170 0.2148 0.0000 ## vs 0.0000 0.0102 0.0004 0.0000



Corrplots and heatmapsif you quickly wanted to convey to the reader the strenght of correlations between several variables, you should use plots.

We have discussed before the scatterplot. But a fancier, and sometimes more useful, way of showing correlation strenghts is a corrplot

Use the function corrplot() from the package with the same name. The size and color of the circles indicate the strength and direction of thecorrelation. check out the many options.

# install.packages('corrplot') library(corrplot) mycorrs <- cor(mtcars[, 4:8], use = "complete.obs", method = "spearman") corrplot(mycorrs)



Othertimes so-called heat maps are useful to quickly spot clusters of associations amongst many variables.

For a famous example, check out the paper “Matching Categorical Object Representations in Inferior Temporal Cortex of Man and Monkey(https://www.cell.com/neuron/fulltext/S0896-6273(08)00943-4)”, published in Neuron by Kriegeskorte and colleagues (2008)

https://www.cell.com/neuron/fulltext/S0896-6273(08)00943-4



Now let’s do our own (slightly less fancy) heat map with the mtcars dataset

mycorrs <- cor(mtcars[, 4:8], use = "complete.obs", method = "spearman") heatmap(mycorrs)



Student’s t-testThe t-test allows us to test

if the average of a variable differs significantly from a specific number (One-sample t-test)if two groups of subjects differ from each other (Independent-samples t-test)if two measures, which were taken on the same group of subjects, differ from each other (Paired-samples t-test)

It is important to remember, that with a t-test you can only compare a max of 2 groups, or 2 conditions. Use ANOVA or linear mixed models if youhave more groups and/or conditions.



By default, the function t.test() assumes unequal variances and applies the Welsh degrees of freedom (df) modification. To specify equalvariances, use the var.equal = TRUE option.

Also by default, a 2-tailed test is performed. Specify alternative="less" or alternative="more" to run a 1-tailed test.

One-sample t-testFor testing if the mean of a sample differs significantly from 0 (or any other number).

For example, we could test if the average horse power of all cars in the dataset mtcars differs from 0 (which it does), and from 150 (which itdoesn’t)

t.test(mtcars$hp, mu = 0)

## ## One Sample t-test ## ## data: mtcars$hp ## t = 12.103, df = 31, p-value = 2.794e-13 ## alternative hypothesis: true mean is not equal to 0 ## 95 percent confidence interval: ## 121.9679 171.4071 ## sample estimates: ## mean of x ## 146.6875

t.test(mtcars$hp, mu = 150)



## ## One Sample t-test ## ## data: mtcars$hp ## t = -0.2733, df = 31, p-value = 0.7864 ## alternative hypothesis: true mean is not equal to 150 ## 95 percent confidence interval: ## 121.9679 171.4071 ## sample estimates: ## mean of x ## 146.6875

Independent t-testFor testing the difference between 2 unrelated samples (i.e. different subjects in two groups or conditions - imagine comparing men withwomen, or drug 1 with drug 2)

Attention: The specification of the formula depends depending on whether your data is in long or wide format

In long format we specify a numeric and a binary factor (separated by “~”):

# select only chicken that followed diets 1 or 2 (we can only compare 2 groups with t-test): ChickWeight_temp <- ChickWeight[ChickWeight$Diet == 1 | ChickWeight$Diet == 2, ] # average across chicken by time and diet ChickWeight_temp_long <- aggregate(ChickWeight_temp, by = list(ChickWeight_temp$Time, ChickWeight_temp$Diet), FUN = mean, na.rm = TRUE) # make a bit nicer ChickWeight_temp_long <- ChickWeight_temp_long[, c(1:3)] colnames(ChickWeight_temp_long)[1:2] <- c("Time", "Diet") # now test if weight differs between diet 1 and 2 t.test(ChickWeight_temp_long$weight ~ as.factor(ChickWeight_temp_long$Diet))



## ## Welch Two Sample t-test ## ## data: ChickWeight_temp_long$weight by as.factor(ChickWeight_temp_long$Diet) ## t = -0.74786, df = 20.988, p-value = 0.4628 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -63.98631 30.13885 ## sample estimates: ## mean in group 1 mean in group 2 ## 105.6929 122.6167

Assume equal variances (this changes the dfs, and slightly the p value):

t.test(ChickWeight_temp_long$weight ~ as.factor(ChickWeight_temp_long$Diet), var.equal = TRUE)

## ## Two Sample t-test ## ## data: ChickWeight_temp_long$weight by as.factor(ChickWeight_temp_long$Diet) ## t = -0.74786, df = 22, p-value = 0.4625 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -63.85473 30.00727 ## sample estimates: ## mean in group 1 mean in group 2 ## 105.6929 122.6167

Change to 1-tailed t-test (modifies the p value)

t.test(ChickWeight_temp_long$weight ~ as.factor(ChickWeight_temp_long$Diet), var.equal = TRUE, alternative = "less")



## ## Two Sample t-test ## ## data: ChickWeight_temp_long$weight by as.factor(ChickWeight_temp_long$Diet) ## t = -0.74786, df = 22, p-value = 0.2312 ## alternative hypothesis: true difference in means is less than 0 ## 95 percent confidence interval: ## -Inf 21.93463 ## sample estimates: ## mean in group 1 mean in group 2 ## 105.6929 122.6167

t.test(ChickWeight_temp_long$weight ~ as.factor(ChickWeight_temp_long$Diet), var.equal = TRUE, alternative = "greater")

## ## Two Sample t-test ## ## data: ChickWeight_temp_long$weight by as.factor(ChickWeight_temp_long$Diet) ## t = -0.74786, df = 22, p-value = 0.7688 ## alternative hypothesis: true difference in means is greater than 0 ## 95 percent confidence interval: ## -55.78209 Inf ## sample estimates: ## mean in group 1 mean in group 2 ## 105.6929 122.6167

Now let’s put the same data in wide format, to compare 2 numeric factors (separated by “,”):

library(reshape2) ChickWeight_temp_wide <- dcast(ChickWeight_temp_long, Time ~ Diet, value.var = "weight") colnames(ChickWeight_temp_wide)[2:3] <- c("Diet1", "Diet2") t.test(ChickWeight_temp_wide$Diet1, ChickWeight_temp_wide$Diet2)



## ## Welch Two Sample t-test ## ## data: ChickWeight_temp_wide$Diet1 and ChickWeight_temp_wide$Diet2 ## t = -0.74786, df = 20.988, p-value = 0.4628 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -63.98631 30.13885 ## sample estimates: ## mean of x mean of y ## 105.6929 122.6167

It is important to point out that you get the same t and p values and dfs using the data in long or wide format!

Dependent (paired-sample) t-testFor testing difference between two related samples (i.e. same people tested twice)

We could use the same ChickenWeight data, but this time compare identical chicken across 2 time points.

IMPORTANT: add the option PAIRED

In long format we specify a numeric and a binary factor (separated by “~”):

# select only Time 1 and 2: ChickWeight_temp <- ChickWeight[ChickWeight$Time <= 2, ] t.test(ChickWeight_temp$weight ~ as.factor(ChickWeight_temp$Time), paired = TRUE)

## ## Paired t-test ## ## data: ChickWeight_temp$weight by as.factor(ChickWeight_temp$Time) ## t = -15.907, df = 49, p-value < 2.2e-16 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -9.190877 -7.129123 ## sample estimates: ## mean of the differences ## -8.16



In the wide format, we specify 2 numeric factors (separated by “,”)

library(reshape2) ChickWeight_temp <- ChickWeight[ChickWeight$Time <= 2, ] ChickWeight_temp_wide <- dcast(ChickWeight_temp, Chick + Diet ~ Time, value.var = "weight") colnames(ChickWeight_temp_wide)[3:4] <- c("time0", "time2") t.test(ChickWeight_temp_wide$time0, ChickWeight_temp_wide$time2, paired = TRUE)

## ## Paired t-test ## ## data: ChickWeight_temp_wide$time0 and ChickWeight_temp_wide$time2 ## t = -15.907, df = 49, p-value < 2.2e-16 ## alternative hypothesis: true difference in means is not equal to 0 ## 95 percent confidence interval: ## -9.190877 -7.129123 ## sample estimates: ## mean of the differences ## -8.16

Mann-Whitney U & Wilcoxon signed ranks testsOne of the assumptions of the t-test, is that the data are normally distributed. However, even in cases when the data is NOT normally distributed,one can use the t-test as long as the sample size is reasonably large: N at least >10, better >100. Thus, the t-test is quite robust for violations ofthe normality assumption, as long as the sample is not too small.

An alternative to the t-test exists , which does not require normal distribution of the data:

The non-parametric Mann-Whitney U test (also called Mann-Whitney-Wilcoxon, Wilcoxon rank-sum, or Wilcoxon-Mann-Whitney test) can be usedas alternative to the independent t-test.

Itis calculated in 2 different ways, depending on the sample size (<20 or >20). In any case, it involves some ranking of the values of both groups,and does not assume normal distribution of the data.

The Wilcoxon signed ranks test can be used as an alternative to the dependent t-test

Implementation of these nonparametric tests is easy, and the formula almost identical to the t-test one.

Here an example of the Mann-Whitney U test:



# select again only chicken that followed diets 1 or 2 (we can only compare 2 groups with t-test or Mann-Whitney): ChickWeight_temp <- ChickWeight[ChickWeight$Diet == 1 | ChickWeight$Diet == 2, ] wilcox.test(ChickWeight_temp[, 1], ChickWeight_temp[, 2])

## ## Wilcoxon rank sum test with continuity correction ## ## data: ChickWeight_temp[, 1] and ChickWeight_temp[, 2] ## W = 115600, p-value < 2.2e-16 ## alternative hypothesis: true location shift is not equal to 0

And here an example of the Wilcoxon signed ranks test:

# select only Time 1 and 2 (applied to all chicken): ChickWeight_temp <- ChickWeight[ChickWeight$Time <= 2, ] # put Time in wide format library(reshape2) ChickWeight_temp_wide <- dcast(ChickWeight_temp, Chick + Diet ~ Time, value.var = "weight") colnames(ChickWeight_temp_wide)[3:4] <- c("time0", "time2") # perform statistics: wilcox.test(ChickWeight_temp_wide$time0, ChickWeight_temp_wide$time2, paired = TRUE)

## ## Wilcoxon signed rank test with continuity correction ## ## data: ChickWeight_temp_wide$time0 and ChickWeight_temp_wide$time2 ## V = 8, p-value = 1.132e-09 ## alternative hypothesis: true location shift is not equal to 0

Simple regressionA simple linear regression is similar to a correlation: You want to know the link between two variables. In a regression, however, you go further thanin a correlation, as you want to know how much y depends (is predicted by) x.

A regression therefore tries to predict changes in the dependent variable based on 1 or more independent variables (predictors), even for a rangeof the IV that is yet unknown.



For example, the mtcars dataset includes cars that have between 52 and 335 horse power. By fitting a regression model with miles per gallon(mpg) as DV, and horse power (hp) as IV, we can try to predict what the mpg would be for a car with 400 hp.

The general formula for a linear regression is lm(DV ~ IV, yourdata) , where DV is the dependent variable, and IV the independent one. Get theresult using summary() , or anova() .

Let’s test if the horsepower of cars predicts their gas consumption (in mpg):

model1 <- lm(mpg ~ hp, mtcars) summary(model1)

## ## Call: ## lm(formula = mpg ~ hp, data = mtcars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.7121 -2.1122 -0.8854 1.5819 8.2360 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 30.09886 1.63392 18.421 < 2e-16 *** ## hp -0.06823 0.01012 -6.742 1.79e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.863 on 30 degrees of freedom ## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892 ## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07

How to read the output:

R-squared informs us about the amount of variance of mpg that is explained by the model (in this case 60%). Since a simple regression onlyincludes 1 IV, R-squared also represents how much variance is explained by hp. If you take the square root of R-squared, you get the Pearsoncorrelation coefficient r: sqrt(0.6024) = 0.77 Thus, there is a strong correlation between mpg and hp.

The Adjusted R-squared indicates how well the model generalizes to the population. This value should be close to R-squared, but is typically a bitsmaller.



The significant F-Statistic in the last line informs us that the model predicts well the DV. Since the model only includes 1 IV, the F-Statistic in thiscase also tells us that hp is a good predictor of mpg.

The most important part of the output is the Coefficients table:

the estimate in the line (Intercept), gives the intercept of the model’s line, i.e. how many miles per gallon you could drive if your car had 0horse power: 30! (our model obviously doesn’t understand much of cars, otherwise it would know that a car with 0 hp does not drive at all)The line starting with “hp” gives: the estimate of how much mpg changes for 1-point change in hp (it is the slope, or gradient, of the line bestrepresenting the data). The t value indicates the magnitude of the IV’s effect, and the p value its statistical significance. Notice that theestimate for hp is negative, because indeed there is a negative relation between mpg and hp: the stronger the car, the fewer mpg of gas youcan drive. The estimate also tells us that for every additional horse power you lose 0.068 mpg. Thus, a car with 400 hp would have 2.8 mpg(the result of 30.09886 - (0.06823 * 400)).

Let’s verify these things by plotting the scatterplot with the fitted regression line. To have the line extend all the way from 0 to 400 hp, we need toinclude xlim(0,400) and the fullrange=TRUE option.

library(ggplot2) ggplot(mtcars, aes(x = hp, y = mpg)) + geom_point() + xlim(0, 400) + stat_smooth(method = "lm", fullrange = TRUE)



In the summary of the regression model, the 3rd line from the bottom provides the degrees of freedom (30).

The statistics can be reported like this: hp significantly predicted mpg, as confirmed by linear regression (t(30) = -6.742, p < .001, R-squared = .6).

Multiple regressionSame as simple regression, only that you want to predict changes in the DV based on several IVs.

A more complicated model, that icludes several predictors, could help explain more variance, and thus achieve a greater R-squared.

You can compare the fit (R-squared) of different regression models using the function anova() .

For example, do we explain more variance by also including the car’s cylinders?



model1 <- lm(mpg ~ hp, mtcars) model2 <- lm(mpg ~ hp + cyl, mtcars) anova(model1, model2)

## Analysis of Variance Table ## ## Model 1: mpg ~ hp ## Model 2: mpg ~ hp + cyl ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 30 447.67 ## 2 29 291.97 1 155.7 15.465 0.0004804 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Yes we do. The comparison of the 2 models showed that model2 fits better to the data. The R-squared of the 2nd model (the more complex onethat includes 2 predictors) is 0.74, and thus significantly bigger than the R-squared of the 1st model (0.60).

c("mod1: ", round(summary(model1)$r.squared, digits = 2))

## [1] "mod1: " "0.6"

c("mod2: ", round(summary(model2)$r.squared, digits = 2))

## [1] "mod2: " "0.74"

By using multiple instead of simple regression, we were able to explain 14% more variance!

Which of the 2 IVs is more important in mod2? To answer that question, we can look at their estimates, SEs, and t-values:

summary(model2)$coefficients



## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 36.9083305 2.19079864 16.846975 1.620660e-16 ## hp -0.0191217 0.01500073 -1.274718 2.125285e-01 ## cyl -2.2646936 0.57588924 -3.932516 4.803752e-04

Only cyl significantly predicts mpg (as indicated by the p value being < .05).Therefore, the number of cylinders of a car has a bigger effect on gasconsumption than horse power.

Another way to look at this is by using the standardized estimates. For that, we use the function lm.beta() from the package QuantPsyc

library(QuantPsyc) lm.beta(model2)

## hp cyl ## -0.2175294 -0.6710802

The standardized beta estimates indicate that gas concumption increases (mpg decreases) by 0.67 standard deviations (SDs), when the numberof cylinders increases by 1 SD. The same change in hp only leads to a 0.21 SD change in mpg.

Similarly, the 95% confidence interval (CI) of the unstandardized beta estimate for cyl does not cross 0, compared to the CI for hp. This clearlyindicates that among the 2 predictors, the number of cylinders is more reliable (the slope is always negative) and plays a bigger role.

confint(model2)

## 2.5 % 97.5 % ## (Intercept) 32.42764417 41.38901679 ## hp -0.04980163 0.01155824 ## cyl -3.44251935 -1.08686785

We can also represent the size of each estimate, and the CI around it, using the function plot_model() of the package sjPlot . I have sorted thehighest effect at the top.

library(sjPlot) plot_model(model2)



One could also add more IVs to the model. However, adding displacement (disp) does not significantly improve model fit

model2 <- lm(mpg ~ hp + cyl, mtcars) model3 <- lm(mpg ~ hp + cyl + disp, mtcars) anova(model2, model3)



## Analysis of Variance Table ## ## Model 1: mpg ~ hp + cyl ## Model 2: mpg ~ hp + cyl + disp ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 29 291.98 ## 2 28 261.37 1 30.605 3.2787 0.08093 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Adding weight (wt) does improve the model significantly:

model2 <- lm(mpg ~ hp + cyl, mtcars) model4 <- lm(mpg ~ hp + cyl + wt, mtcars) anova(model2, model4)

## Analysis of Variance Table ## ## Model 1: mpg ~ hp + cyl ## Model 2: mpg ~ hp + cyl + wt ## Res.Df RSS Df Sum of Sq F Pr(>F) ## 1 29 291.98 ## 2 28 176.62 1 115.35 18.287 0.0001995 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Indeed, the R-squared is now 0.84. But notice how adding weight to the regression, makes hp and cyl to insignificant predictors (their p values>.05). This is the case because the IVs included in the model are not independent of each other. The estimate (slope) of each IV assumes that theother IVs stay constant, and in fact correspond to their average values. Thus, when weight and horse power do not change, the number ofcylinders no longer plays a role!!

summary(model4)



## ## Call: ## lm(formula = mpg ~ hp + cyl + wt, data = mtcars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -3.9290 -1.5598 -0.5311 1.1850 5.8986 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 38.75179 1.78686 21.687 < 2e-16 *** ## hp -0.01804 0.01188 -1.519 0.140015 ## cyl -0.94162 0.55092 -1.709 0.098480 . ## wt -3.16697 0.74058 -4.276 0.000199 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 2.512 on 28 degrees of freedom ## Multiple R-squared: 0.8431, Adjusted R-squared: 0.8263 ## F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11

library(sjPlot) plot_model(model4)



ANOVA (one-way, two-way, within, between, mixed, rep measures)The Analysis of Variance (ANOVA) is a special case of a linear regression, or linear model, in which the predictor variable(s) are categorical(instead of continuous)

For example, you could have the between-subjects (meaning different for each subject) factor “treatment” with 3 levels: drug A, drug B, andplacebo. And/or the within-subjects (meaning every subject does all of them) factor Condition, with 3 levels: Attention to faces, to houses, andcontrol condition.

One-way ANOVA



The fact that linear regression and ANOVA are similar, becomes evident by the fact that you can also look at the results of a linear regression usingthe anova() function.

Let’s again regress mpg on hp, using the mtcars data.

summary(lm(mpg ~ hp, mtcars))

## ## Call: ## lm(formula = mpg ~ hp, data = mtcars) ## ## Residuals: ## Min 1Q Median 3Q Max ## -5.7121 -2.1122 -0.8854 1.5819 8.2360 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 30.09886 1.63392 18.421 < 2e-16 *** ## hp -0.06823 0.01012 -6.742 1.79e-07 *** ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 3.863 on 30 degrees of freedom ## Multiple R-squared: 0.6024, Adjusted R-squared: 0.5892 ## F-statistic: 45.46 on 1 and 30 DF, p-value: 1.788e-07

Notice that in the last line the F-statistics are given. Indeed, you could also look at the results of linear regression using the anova() function:

anova(lm(mpg ~ hp, mtcars))



## Analysis of Variance Table ## ## Response: mpg ## Df Sum Sq Mean Sq F value Pr(>F) ## hp 1 678.37 678.37 45.46 1.788e-07 *** ## Residuals 30 447.67 14.92 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Another way to run an ANOVA and get a results table similar to the one of SPSS, is to use the aov() function

The first argument is the dependent variable. It is followed by the tilde symbol (~) and the independent variable(s). finally you give the formula thedataset to be used. you store the analysis in an object, and then look at the results using summary()

It is also possible to get the means and number of subjs per cell using print(model.tables()) - only makes sense with a categorical predictor

Let’s test again if mpg differs by hp

model2 <- aov(mpg ~ hp, mtcars) summary(model2)

## Df Sum Sq Mean Sq F value Pr(>F) ## hp 1 678.4 678.4 45.46 1.79e-07 *** ## Residuals 30 447.7 14.9 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

And to use a categorical predictor, let’s see if mpg differs by number of cylinders (which it does)

# first convert cyl from numeric to factor mtcars_temp <- mtcars mtcars_temp$cyl <- as.factor(mtcars_temp$cyl) # then run the anova model3 <- aov(mpg ~ cyl, mtcars_temp) summary(model3)



## Df Sum Sq Mean Sq F value Pr(>F) ## cyl 2 824.8 412.4 39.7 4.98e-09 *** ## Residuals 29 301.3 10.4 ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

A quick way to get the means and number of cars per cylinder-group is to use model.tables() :

print(model.tables(model3, "means"), digits = 3)

## Tables of means ## Grand mean ## ## 20.09062 ## ## cyl ## 4 6 8 ## 26.7 19.7 15.1 ## rep 11.0 7.0 14.0

Do we get the same means with ddply? Yes we do!

ddply(mtcars, c("cyl"), summarise, N_cars = length(cyl), mpg_mean = round(mean(mpg), digits = 1))

## cyl N_cars mpg_mean ## 1 4 11 26.7 ## 2 6 7 19.7 ## 3 8 14 15.1

To display boxplots of mpg by cylinder-group:

boxplot(mpg ~ cyl, data = mtcars_temp)



repeated measures ANOVASee this introduction (https://www.datacamp.com/courses/repeated-measures-anova) to repeated measures Analysis of Variance (rmANOVA) byDataCamp.

There are different ways of computing an rmANOVA. For example, in his book Discovering Statistics Using R (https://www.amazon.de/Discovering-Statistics-Using-Andy-Field/dp/1446200469) Andy Field lists 4 ways.

Of these, the azANOVA() function from the az package is (as of November 2018), not available for the latest R version (3.5.1).

I thus recommend using aov() , or a linear mixed model (LMM) with the function lmer() from the package lme4

…

https://www.datacamp.com/courses/repeated-measures-anova

https://www.amazon.de/Discovering-Statistics-Using-Andy-Field/dp/1446200469

r for beginners - wordpress.com · 3/25/2019 r for beginners...

Documents