introduction to r: for absolute beginners...how to speak r getting to know your data fitting...

67
How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners Office of Methodological & Data Sciences Sarah Schwartz 1 BNR 278 12:30 pm - 3:20 pm, October 2, 2012 1 EDUC 455, (435)797-0169, [email protected] or [email protected], http://www.cehs.usu.edu/research/omds

Upload: others

Post on 29-Jul-2020

5 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Introduction to R: for Absolute BeginnersOffice of Methodological & Data Sciences

Sarah Schwartz1

BNR 278

12:30 pm - 3:20 pm, October 2, 2012

1EDUC 455, (435)797-0169, [email protected] or [email protected],http://www.cehs.usu.edu/research/omds

Page 2: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Download & Install 2 pieces of free software

Video walk-through of both installations link: HEREaccept all defaults

https://www.r-project.org/

• Install first

• “Software Environment”

• The brain

• We won’t work directly with it

https://www.rstudio.com/

• Install second

• “User Interface”

• The go-between for us

• Auto completes & color codes

Page 3: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Helpful Websites

Tutorials by William B. King, PhD, Coastal Carolina Universityhttp://ww2.coastal.edu/kingw/statistics/R-tutorials

RexRepos R Example Repositoryhttp://www.uni-kiel.de/psychologie/rexrepos

R-bloggers R news & tutorials: broad coveragehttp://www.r-bloggers.com

Quick-R Accessing the power of R includes some graphshttp://www.statmethods.net

Psychology Using R for psychological researchhttp://personality-project.org/r

Page 4: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Outline

How to Speak RNuts & BoltsUsing Add-on PackagesHow to Read in YOUR Own Data

Getting to Know Your DataNumeric SummariesGraphical Summaries

Fitting Statistical ModelsMotor Trend Car Road TestsComparing Group CentersRegression Models

Page 5: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Rstudio Workspace

Page 6: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Other User Interfaces Exist...

R Commander (Rcmdr) http://www.rcommander.com

Page 7: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Basic Calculations

prompt in the console,command-line

case sensetive ‘anova’ not the same as‘ANOVA’

comment lines Use the # symbol atleast once

1 + 3 #### addition

## [1] 4

16 / 2 #### division

## [1] 8

5 ^ 2 ###### powers

## [1] 25

sqrt(144) # square root

## [1] 12

log(1.3) #### logrithm

## [1] 0.2623643

Page 8: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Create & Remove Objects

# ALL OF THESE DO THE SAME THINGx=7x = 7x= 7x = 7x = # Press Enter here.7 # Press Enter again.

# TWO WAYS TO ASSIGN OBJECTSAval = 7 # use the equalB.val = 15 # names: no spacesCval <- 10 # use an arrowls() # list environment

## [1] "Aval" "B.val" "Cval" "x"

# YOU CAN REMOVE OBJECTS AFTER CREATING THEMrm(B.val) # remove from environmentls() # list the environment

## [1] "Aval" "Cval" "x"

Aval # what is assigned to this?

## [1] 7

aval # CAPS MATTER!!!

## Error in eval(expr, envir, enclos): object ’aval’ not found

Page 9: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

A double-equal tests for equivalence:

5 == 6 # are these equal?

## [1] FALSE

3 < 10 # 'less than'

## [1] TRUE

1 < 2 | 2 == 3 # '|' means `or'

## [1] TRUE

Aval < Cval # can test objects

## [1] TRUE

# Create a vector with "combine"vec1 = c(1, 2, 7, 3, 2, -3)

# Are there ANY TWOs?2 %in% vec1

## [1] TRUE

# test EACH VALUE to see if it is TWO2 == vec1

## [1] FALSE TRUE FALSE FALSE## [5] TRUE FALSE

# COUNT the number of TWOssum(2 == vec1)

## [1] 2

Page 10: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Some Possible CLASSES of R Objects

Individual VALUES:

numeric number values

logical either ‘TRUE’ (codes to 1) or ‘FALSE’ (codes to 0)

factor categorical levels, nominal or ordinal

character text or ‘string’ in SPSS

Data OBJECTS:

vector a 1-dimentional listing of single elements

matrix a 2-dimentional array of elements (rows & columns)

data.frame a matrix with more formatting (nice labels)

Page 11: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

x = 1:5

class(x)

## [1] "integer"

x

## [1] 1 2 3 4 5

y = x / 3

class(y)

## [1] "numeric"

y

## [1] 0.3333333## [2] 0.6666667## [3] 1.0000000## [4] 1.3333333## [5] 1.6666667

z = x > 4

class(z)

## [1] "logical"

z

## [1] FALSE FALSE## [3] FALSE FALSE## [5] TRUE

c = factor(c("m","m" ,"f","f","m"))

class(c)

## [1] "factor"

c

## [1] m m f f m## 2 Levels: f ...

Page 12: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Finding a Function

If you’re not sure of a function’s name,use ‘apropors’ to search for it:

apropos("round")

## [1] "round"## [2] "round.Date"## [3] "round.POSIXt"

Then you can search the name of thefunction in the HELP tab of theRStudio. (or use google)

apropos("mean")

## [1] ".colMeans"## [2] ".rowMeans"## [3] "colMeans"## [4] "kmeans"## [5] "mean"## [6] "mean.Date"## [7] "mean.default"## [8] "mean.difftime"## [9] "mean.POSIXct"## [10] "mean.POSIXlt"## [11] "rowMeans"## [12] "weighted.mean"

Page 13: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

You can use the Help tab in RStudio to find out about a function.

# Ask for the function's argumentsargs(round)

## function (x, digits = 0)## NULL

round(2.4)

## [1] 2

ceiling(2.4)

## [1] 3

floor(2.4)

## [1] 2

round(2.7)

## [1] 3

ceiling(2.7)

## [1] 3

floor(2.7)

## [1] 2

Page 14: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Missing Values

data = c(1, 0, 2, 5, NA)is.na(data)

## [1] FALSE FALSE FALSE## [4] FALSE TRUE

anyNA(data)

## [1] TRUE

Different functions havedifferent default ways tohandle missing values.Use the HELP todetermine what is thedefault and how tochange it.

1 + 0 + 2 + 5

## [1] 8

mean(data)

## [1] NA

mean(data, na.rm = TRUE)

## [1] 2

sd(data)

## [1] NA

sd(data, na.rm = TRUE)

## [1] 2.160247

Page 15: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

R Base vs. External Packages

When you download R, you are only getting the base functions. This is arelatively small collection of functions, but it keeps R running fast.

# included in R base:summary(data) # basic summary statistics

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's## 0.00 0.75 1.50 2.00 2.75 5.00 1

table(data) # tabulates categoricals

## data## 0 1 2 5## 1 1 1 1

Packages are collections of R functions, data, and compiled code in awell-defined format. The directory where packages are stored is called thelibrary.

By only downloading and installing the packages you need, on aproject-by-project basis, R uses less storage space on your hard drive and activememory.

Page 16: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Hundreds of packages are available for download and installation. Many arevetted and distributed by CRAN, others are available on GitHub, or you cancreate & share packages on an individual level.

Install Download to your computer’s hard drive ONLY ONCE

Load Activate the package’s library EVERY session

# Code for installing all the# packagesin this document

install.packages("psych","xlsx","haven","lattice","MASS","ggplot2","popbio","beeswarm")

NOTE: when you download your first package, select a mirror (a proxy server)

Page 17: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

The ‘Psych’ Package

This has been developed at Northwestern University since 2005 to includefunctions most useful for personality, psychometric, and psychological research.The package is also meant to supplement a text on psychometric theory, adraft of which is available at http://personality-project.org/r/book.

# 'LOAD' or 'activate' the packagelibrary(psych)

This package has a nice feature for reading in data from your clipboard:

1. Highlight the data in Excel, including the first row with variable names

2. ‘Copy’ the selection, moving the information to the clipboard

3. Run the code below to store it in R as an object named pipiData

# International Personality Item Poolbfi = read.clipboard.tab()

Page 18: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Personality self report items taken from the International Personality ItemPool (http://ipip.ori.org) and was included as part of the SyntheticAperture Personality Assessment (SAPA) web based personality assessmentproject http://SAPA-project.org.

5 Items x 5 Factors

• Agreeableness

• Conscientiousness

• Extraversion

• Neuroticism

• Opennness

Response Scale

1. Very Inaccurate

2. Moderately Inaccurate

3. Slightly Inaccurate

4. Slightly Accurate

5. Moderately Accurate

6. Very Accurate

Demographic

• gender

• education

• age

Page 19: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Investigate the Form of Your Data

class(bfi) # you probably want a data.frame

## [1] "data.frame"

dim(bfi) # rows (subjeccts) & columns (variables)

## [1] 2800 28

names(bfi) # columns should have avariables names

## [1] "A1" "A2" "A3" "A4" "A5" "C1" "C2"## [8] "C3" "C4" "C5" "E1" "E2" "E3" "E4"## [15] "E5" "N1" "N2" "N3" "N4" "N5" "O1"## [22] "O2" "O3" "O4" "O5" "gender" "education" "age"

table(complete.cases(bfi)) # are the cases complete? (no missing values)

#### FALSE TRUE## 564 2236

Page 20: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Declare Categorical Variables - GENDER

# look at the raw form: 4 ways designate a variablebfi[, 26] # designate column number...bfi[, c("gender")] # ...or column name...bfi["gender"] # ...all do the same thing...bfi$gender # ...this is the most common

class(bfi$gender) # the variable's "class"

## [1] "integer"

head(bfi$gender) # look at top cases

## [1] 1 2 2 2 1 2

summary(bfi$gender) # how does it get summarized?

## Min. 1st Qu. Median Mean 3rd Qu. Max.## 1.000 1.000 2.000 1.672 2.000 2.000

table(bfi$gender) # what does "table" do?

#### 1 2## 919 1881

Page 21: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Declare Categorical Variables - GENDER

# define it as categorical: FACTOR is "nominal"bfi$gender = factor(bfi$gender, labels = c("male", "female"))

# now its ready to goclass(bfi$gender) # did the "class" change?

## [1] "factor"

head(bfi$gender) # does it look different?

## [1] male female female female male female## Levels: male female

summary(bfi$gender) # is the summary the same?

## male female## 919 1881

levels(bfi$gender) # this gives a list the LABELS

## [1] "male" "female"

Page 22: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Declare Categorical Variables - EDUCATION

table(bfi$education) # look at the raw form

#### 1 2 3 4 5## 224 292 1249 394 418

# define as categorical: ORDERED is "ordinal"bfi$education = ordered(bfi$education,

labels = c("<HS", "HS", "HS+ ", "degree", "grad+"))# now its ready to gohead(bfi$education, n = 15)

## [1] <NA> <NA> <NA> <NA> <NA> HS+ <NA> HS <HS <NA> <HS <NA> <NA> <NA> <HS## Levels: <HS < HS < HS+ < degree < grad+

summary(bfi$education)

## <HS HS HS+ degree grad+ NA's## 224 292 1249 394 418 223

levels(bfi$education)

## [1] "<HS" "HS" "HS+ " "degree" "grad+"

Page 23: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

bfi[1:3, ] # specify rows (subjects) in FRONT of the comma

## A1 A2 A3 A4 A5 C1 C2 C3 C4 C5 E1 E2 E3 E4 E5 N1 N2 N3 N4 N5 O1 O2 O3 O4 O5 gender## 61617 2 4 3 4 4 2 3 3 4 4 3 3 3 4 4 3 4 2 2 3 3 6 3 4 3 male## 61618 2 4 5 2 5 5 4 4 3 4 1 1 6 4 3 3 3 3 5 5 4 2 4 3 3 female## 61620 5 4 5 4 4 4 5 4 2 5 2 4 4 4 5 4 5 4 2 3 4 2 5 5 2 female## education age## 61617 <NA> 16## 61618 <NA> 18## 61620 <NA> 17

bfi[1:4, 1:7] # specify columns (variables) AFTER the comma

## A1 A2 A3 A4 A5 C1 C2## 61617 2 4 3 4 4 2 3## 61618 2 4 5 2 5 5 4## 61620 5 4 5 4 4 4 5## 61621 4 4 6 5 5 4 4

# ...or list the names of the variables (after comma)bfi[1:3, c("A1", "A2", "A3","A4", "A5", "gender", "education", "age")]

## A1 A2 A3 A4 A5 gender education age## 61617 2 4 3 4 4 male <NA> 16## 61618 2 4 5 2 5 female <NA> 18## 61620 5 4 5 4 4 female <NA> 17

Page 24: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Saving a Reduced Dataset# suppose I'm only interested in subjects under the age of 35table(bfi$age < 35)

#### FALSE TRUE## 738 2062

# AND I only want to keep a few variables (for demo)bfiA = bfi[bfi$age < 35,

c("A1", "A2", "A3","A4", "A5", "gender", "education", "age")]

dim(bfiA) # see a few lines from top and bottom

## [1] 2062 8

headTail(bfiA)

## A1 A2 A3 A4 A5 gender education age## 61617 2 4 3 4 4 male <NA> 16## 61618 2 4 5 2 5 female <NA> 18## 61620 5 4 5 4 4 female <NA> 17## 61621 4 4 6 5 5 female <NA> 17## ... ... ... ... ... ... <NA> <NA> ...## 67551 6 1 3 3 3 male HS+ 19## 67552 2 4 4 3 5 male degree 27## 67556 2 3 5 2 5 female degree 29## 67559 5 2 2 4 4 male degree 31

Page 25: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

How to Read in YOUR Own DataBefore you can load your data, you need to tell R where to look.

# get the working directorygetwd()

## [1] "C:/Users/A00315273/Box Sync/Office of Research Services/OMDS/OMDS Workshops/OMDS intro to R"

Notice: you need to use shashes instead of backslashes

# change the working directory to YOUR COMPUTER!!!setwd("C:/Users/A00315273/OMDSworkshop")

If the data is stored in a TEXT file, comma delimited...

# there functions are part of the BASE RmyData = read.table("data.txt", header = TRUE)myData = read.csv("data.csv", header = TRUE)

Page 26: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Best Practices: DataSet in Excel

Often, you may enter your data into Excel.

Make sure the FIRST ROW contains the names of variables.

Names, Values, & Fields

• FIRST variable is unit identification

• NEVER use white SPACES

• AVOID symbols or punctuation: ? [ } * $ %

• USE . or to push words together

• KEEP it short, but meaningful

• ALWAYS use numbers over text

• LEAVE missing cells blank (not .)

Page 27: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Read in Data from Excel Files

Bad Example

Much Better!

Page 28: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Read in Data from Excel Files

# there's a package for that!# "Read, write, format Excel 2007 (xlsx) files"library(xlsx)

# read.xlsx tries to guess variables classes# read.xlsx2 is faster at bigger datasets

myData = read.xlsx("data.xlsx",sheetIndex = 1, # or use sheetName, insteadheader = TRUE) # TRUE if 1st row = names

NOTE: If you are having problems with Excel datasets, try saving it as a “.csv”file (comma delimited) and use the read.table function in Base R.

Page 29: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Read in Data from SPSS, SAS, & Stata Files

# New package this summer...Hadley Wickham is my HERO!library(haven)

# Currently haven can read and write:# logical, integer, numeric, character and factors

# SPSS: Supports both sav & por filesmyData = read_spss("data.sav")myData = read_sav("data.sav")myData = read_por("data.sav")

# SAS: Supports both b7dat & b7cat filesmyData = read_sas("data.b7dat")

# StatamyData = read_stata("data.dta")myData = read_dta("data.dta")

# NOTE all labeled variables are a new class: "labelled"# ... use as_factor() to treat the variable categorical# ... use zap_labels() to treat the variable as continuous

Page 30: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Outline

How to Speak RNuts & BoltsUsing Add-on PackagesHow to Read in YOUR Own Data

Getting to Know Your DataNumeric SummariesGraphical Summaries

Fitting Statistical ModelsMotor Trend Car Road TestsComparing Group CentersRegression Models

Page 31: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Mean, Standard Deviation, Ect...

# descriptives on all variablesdescribe(bfiA)

## vars n mean sd median trimmed mad min max range skew kurtosis se## A1 1 2053 2.52 1.42 2 2.36 1.48 1 6 5 0.73 -0.44 0.03## A2 2 2040 4.75 1.20 5 4.92 1.48 1 6 5 -1.07 0.86 0.03## A3 3 2048 4.57 1.31 5 4.75 1.48 1 6 5 -0.97 0.39 0.03## A4 4 2048 4.59 1.54 5 4.81 1.48 1 6 5 -0.91 -0.29 0.03## A5 5 2050 4.50 1.26 5 4.64 1.48 1 6 5 -0.79 0.07 0.03## gender* 6 2062 1.66 0.47 2 1.70 0.00 1 2 1 -0.68 -1.54 0.01## education* 7 1853 3.09 1.06 3 3.11 0.00 1 5 4 -0.04 -0.03 0.02## age 8 2062 23.16 5.22 22 22.98 5.93 3 34 31 0.25 -0.59 0.12

Page 32: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Mean, Standard Deviation, Ect...

# split by a grouping variabledescribeBy(bfiA, bfiA$gender)

## group: male## vars n mean sd median trimmed mad min max range skew kurtosis se## A1 1 699 2.81 1.43 3 2.71 1.48 1 6 5 0.48 -0.75 0.05## A2 2 691 4.46 1.30 5 4.61 1.48 1 6 5 -0.88 0.25 0.05## A3 3 695 4.38 1.30 5 4.52 1.48 1 6 5 -0.78 0.01 0.05## A4 4 697 4.31 1.51 5 4.45 1.48 1 6 5 -0.64 -0.62 0.06## A5 5 695 4.35 1.33 5 4.49 1.48 1 6 5 -0.74 -0.13 0.05## gender* 6 699 1.00 0.00 1 1.00 0.00 1 1 0 NaN NaN 0.00## education* 7 626 3.11 1.15 3 3.14 1.48 1 5 4 -0.04 -0.40 0.05## age 8 699 22.83 5.04 22 22.63 4.45 3 34 31 0.27 -0.29 0.19## -------------------------------------------------------------------## group: female## vars n mean sd median trimmed mad min max range skew kurtosis se## A1 1 1354 2.37 1.39 2 2.17 1.48 1 6 5 0.88 -0.16 0.04## A2 2 1349 4.90 1.12 5 5.07 1.48 1 6 5 -1.16 1.22 0.03## A3 3 1353 4.67 1.31 5 4.86 1.48 1 6 5 -1.10 0.68 0.04## A4 4 1351 4.74 1.53 5 4.99 1.48 1 6 5 -1.08 0.04 0.04## A5 5 1355 4.58 1.22 5 4.71 1.48 1 6 5 -0.80 0.14 0.03## gender* 6 1363 2.00 0.00 2 2.00 0.00 2 2 0 NaN NaN 0.00## education* 7 1227 3.08 1.02 3 3.09 0.00 1 5 4 -0.04 0.21 0.03## age 8 1363 23.32 5.31 23 23.17 5.93 9 34 25 0.23 -0.73 0.14

Page 33: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Cross Tabulations & χ2 test for Independence

# split by a grouping variable# If a variable is included on the left side of the formula,# it is assumed to be a vector of frequenciesedXgender = xtabs(~ education + gender, data = bfiA)edXgender

## gender## education male female## <HS 71 109## HS 70 121## HS+ 303 691## degree 84 169## grad+ 98 137

# chi-squared test for independencechisq.test(edXgender)

#### Pearson's Chi-squared test#### data: edXgender## X-squared = 14.746, df = 4, p-value = 0.005258

Page 34: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Correlation Matrix

How strong is the association between the 5 Agreement Items?

# reduce the dataset for easy of demonstrationbfiAonly = bfi[, c("A1", "A2", "A3", "A4", "A5")]

# GET CORRELATION VALUES & P-VALUEScor(bfiAonly, use = "pairwise.complete.obs")

## A1 A2 A3 A4 A5## A1 1.0000000 -0.3401932 -0.2652471 -0.1464245 -0.1814383## A2 -0.3401932 1.0000000 0.4850980 0.3350872 0.3900836## A3 -0.2652471 0.4850980 1.0000000 0.3604283 0.5041411## A4 -0.1464245 0.3350872 0.3604283 1.0000000 0.3075373## A5 -0.1814383 0.3900836 0.5041411 0.3075373 1.0000000

round(cor(bfiAonly, use = "pairwise.complete.obs"), 3)

## A1 A2 A3 A4 A5## A1 1.000 -0.340 -0.265 -0.146 -0.181## A2 -0.340 1.000 0.485 0.335 0.390## A3 -0.265 0.485 1.000 0.360 0.504## A4 -0.146 0.335 0.360 1.000 0.308## A5 -0.181 0.390 0.504 0.308 1.000

Page 35: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Correlation Matrix with p-values

corr.test(bfiAonly,adjust = "none",method = "spearman")

## Call:corr.test(x = bfiAonly, method = "spearman", adjust = "none")## Correlation matrix## A1 A2 A3 A4 A5## A1 1.00 -0.37 -0.30 -0.16 -0.22## A2 -0.37 1.00 0.50 0.34 0.40## A3 -0.30 0.50 1.00 0.36 0.53## A4 -0.16 0.34 0.36 1.00 0.31## A5 -0.22 0.40 0.53 0.31 1.00## Sample Size## A1 A2 A3 A4 A5## A1 2784 2757 2759 2767 2769## A2 2757 2773 2751 2758 2757## A3 2759 2751 2774 2759 2758## A4 2767 2758 2759 2781 2765## A5 2769 2757 2758 2765 2784## Probability values (Entries above the diagonal are adjusted for multiple tests.)## A1 A2 A3 A4 A5## A1 0 0 0 0 0## A2 0 0 0 0 0## A3 0 0 0 0 0## A4 0 0 0 0 0## A5 0 0 0 0 0#### To see confidence intervals of the correlations, print with the short=FALSE option

Page 36: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Correlation Matrix VisualizeA picture can be worth a thousand words

cor.plot(cor(bfiAonly, use = "pairwise.complete.obs", method = "spearman"))

Correlation plot

A5

A4

A3

A2

A1

A1 A2 A3 A4 A5

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Page 37: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

psych’s All-in-One PlotA picture can be worth a thousand words

# plots pairs of variablespairs.panels(bfiAonly)

A1

1 2 3 4 5 6

−0.34 −0.27

1 2 3 4 5 6

−0.15

13

5

−0.18

13

5 A20.49 0.34 0.39

A30.36

13

5

0.50

13

5 A40.31

1 2 3 4 5 6 1 2 3 4 5 6 1 2 3 4 5 6

13

5A5

Page 38: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Histogram: Defaults vs. Options

# all defaultshist(bfi$A1)

Histogram of bfi$A1

bfi$A1

Fre

quen

cy

1 2 3 4 5 6

020

060

0

# better with some defaultshist(bfi$A1,

breaks = 0.5:6.5,main = "This is Much Better",xlab = "Item A-1",col = "gray")

This is Much Better

Item A−1

Fre

quen

cy

1 2 3 4 5 60

200

600

Page 39: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Histogram: Use More Code!0

200

400

600

800

1000

Ready for Publication

''Am indifferent to the feelings of others''Agreeableness Item #1 (q.1146)

Fre

quen

cy

Very Mod Slight Slight Mod VeryInaccuration Accurate

Page 40: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Density Plot: Continuous Distribution

# one way to put two plots on the same pagepar(mfrow=c(1, 2)) # 1 row & 2 columnshist(bfi$age) # rough distributionplot(density(bfi$age, na.rm = TRUE)) # smoothed out

Histogram of bfi$age

bfi$age

Fre

quen

cy

0 20 40 60 80

020

040

060

0

0 20 40 60 80

0.00

0.02

0.04

density.default(x = bfi$age, na.rm = TRUE)

N = 2800 Bandwidth = 2.047

Den

sity

Page 41: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Density Plot: AGE

0 20 40 60 80

0.00

0.01

0.02

0.03

0.04

0.05

0.06

Compare to the Normal Curve

Age

Pro

port

ion

Curves

densitynormal

Page 42: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Bar Plot: Categorical Distribution

par(mfrow=c(1, 2)) # 1 row & 2 columns

# one variable at a time (must give it counts!)barplot(table(bfi$gender))barplot(table(bfi$education))

male female

050

010

0015

00

<HS HS degree

020

060

010

00

Page 43: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Bar Plot: Compare 2 Categorical Distributions0

200

400

600

800

1000

Synthetic Aperture Personality Assessment (SAPA)

Highest Level of Education

Fre

quen

cy

<HS HS HS+ degree grad+

malefemale

020

040

060

080

010

00

Page 44: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Boxplots: GENDER & EDUCATION

par(mfrow=c(1, 2)) # 1 row & 2 columns

# all togetherboxplot(bfiA$age)

# split by education groupsboxplot(bfi$age ~ bfi$education)

510

2030

<HS HS+ grad+

020

4060

80

Page 45: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Boxplots: Use More Options

# reset to one plot per pagepar(mfrow=c(1, 1))

# make it look betterboxplot(age ~ education, data = bfi,

col = heat.colors(5),main = "Build a Better Boxplots",xlab = "Highest Education Obtained",ylab = "Age (years)")

<HS HS HS+ degree grad+

020

4060

80

Build a Better Boxplots

Highest Education Obtained

Age

(ye

ars)

Page 46: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Boxplots: AGE & EDUCATION0

2040

6080

Compare the Genders

Highest Education Obtained

Age

(ye

ars)

020

4060

80

<HS HS HS+ degree grad+

malefemale

Page 47: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Scatterplots: Display Associations

Jitter the education level so dots don’t cover each other so much.

# put 3 plots in one row/pagepar(mfrow = c(1, 3))

plot(bfi$age,jitter(as.numeric(bfi$education),

factor = 0.25),main = "factor = 0.25")

plot(bfi$age,jitter(as.numeric(bfi$education),

factor = 1),main = "factor = 1")

plot(bfi$age,jitter(as.numeric(bfi$education),

factor = 2),main = "factor = 2") 0 20 40 60 80

12

34

5

factor = 0.25

bfi$age

jitte

r(as

.num

eric

(bfi$

educ

atio

n), f

acto

r =

0.2

5)

0 20 40 60 80

12

34

5

factor = 1

bfi$age

jitte

r(as

.num

eric

(bfi$

educ

atio

n), f

acto

r =

1)

0 20 40 60 80

12

34

5

factor = 2

bfi$age

jitte

r(as

.num

eric

(bfi$

educ

atio

n), f

acto

r =

2)

Page 48: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Scatterplots: AGE & EDUCATION

0 20 40 60 80

Jitter the Ordinal Variable

Age (years)

Edu

catio

n

<HS

HS

HS+

degree

grad+

0 20 40 60 80

Page 49: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Bubble Plot: Helpful with Overplotting

If you can dream of a type of plot, you can create it!

# aggregate the databfiAag = aggregate(bfiA,

by = list(bfiA$A1,bfiA$A2),

length)

# circle's area ~ number of pointssymbols(bfiAag$Group.1,

bfiAag$Group.2,circles = sqrt(bfiAag$A1/pi)/50,inches = FALSE,main = "Bubble Plot",xlab = "item A1",ylab = "item A2")

1 2 3 4 5 6

12

34

56

Bubble Plot

item A1

item

A2

Page 50: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Outline

How to Speak RNuts & BoltsUsing Add-on PackagesHow to Read in YOUR Own Data

Getting to Know Your DataNumeric SummariesGraphical Summaries

Fitting Statistical ModelsMotor Trend Car Road TestsComparing Group CentersRegression Models

Page 51: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Motor Trend Car Road Tests

The data was extracted from the 1974 Motor Trend US magazine, andcomprises fuel consumption and 10 aspects of automobile design andperformance for 32 automobiles (1973-74 models).

mpg Miles/(US) gallon

cyl Number of cylinders

disp Displacement (cu.in.)

hp Gross horsepower

drat Rear axle ratio

wt Weight (lb/1000)

qsec 1/4 mile time

vs V/S

am Transmission

gear Number of forward gears

carb Number of carburetors

Page 52: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Load car Package & the mtcars Data

# Load a New Package:library(car) # "Companion to Applied Regression" (a textbook)

data(mtcars) # Make its Included Data Set Active in the Environment

# check out the datadim(mtcars)

## [1] 32 11

names(mtcars)

## [1] "mpg" "cyl" "disp" "hp" "drat" "wt" "qsec" "vs" "am" "gear" "carb"

# set the categorical variablesmtcars$vs = factor(mtcars$vs, labels = c("v", "s"))mtcars$am = factor(mtcars$am, labels = c("automatic", "manual"))

Page 53: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

headTail(mtcars)

## mpg cyl disp hp drat wt qsec vs am gear carb## Mazda RX4 21 6 160 110 3.9 2.62 16.46 v manual 4 4## Mazda RX4 Wag 21 6 160 110 3.9 2.88 17.02 v manual 4 4## Datsun 710 22.8 4 108 93 3.85 2.32 18.61 s manual 4 1## Hornet 4 Drive 21.4 6 258 110 3.08 3.21 19.44 s automatic 3 1## ... ... ... ... ... ... ... ... <NA> <NA> ... ...## Ford Pantera L 15.8 8 351 264 4.22 3.17 14.5 v manual 5 4## Ferrari Dino 19.7 6 145 175 3.62 2.77 15.5 v manual 5 6## Maserati Bora 15 8 301 335 3.54 3.57 14.6 v manual 5 8## Volvo 142E 21.4 4 121 109 4.11 2.78 18.6 s manual 4 2

summary(mtcars)

## mpg cyl disp hp drat## Min. :10.40 Min. :4.000 Min. : 71.1 Min. : 52.0 Min. :2.760## 1st Qu.:15.43 1st Qu.:4.000 1st Qu.:120.8 1st Qu.: 96.5 1st Qu.:3.080## Median :19.20 Median :6.000 Median :196.3 Median :123.0 Median :3.695## Mean :20.09 Mean :6.188 Mean :230.7 Mean :146.7 Mean :3.597## 3rd Qu.:22.80 3rd Qu.:8.000 3rd Qu.:326.0 3rd Qu.:180.0 3rd Qu.:3.920## Max. :33.90 Max. :8.000 Max. :472.0 Max. :335.0 Max. :4.930## wt qsec vs am gear carb## Min. :1.513 Min. :14.50 v:18 automatic:19 Min. :3.000 Min. :1.000## 1st Qu.:2.581 1st Qu.:16.89 s:14 manual :13 1st Qu.:3.000 1st Qu.:2.000## Median :3.325 Median :17.71 Median :4.000 Median :2.000## Mean :3.217 Mean :17.85 Mean :3.688 Mean :2.812## 3rd Qu.:3.610 3rd Qu.:18.90 3rd Qu.:4.000 3rd Qu.:4.000## Max. :5.424 Max. :22.90 Max. :5.000 Max. :8.000

Page 54: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Test Central Differences in 2 Independent Groups# find the meansdescribeBy(mtcars$mpg, mtcars$am)

## group: automatic## vars n mean sd median trimmed mad min max range skew kurtosis se## 1 1 19 17.15 3.83 17.3 17.12 3.11 10.4 24.4 14 0.01 -0.8 0.88## -------------------------------------------------------------------## group: manual## vars n mean sd median trimmed mad min max range skew kurtosis se## 1 1 13 24.39 6.17 22.8 24.38 6.67 15 33.9 18.9 0.05 -1.46 1.71

# view the two groups side-by-sideboxplot(mpg ~ am, data = mtcars, horizontal = TRUE)

auto

mat

ic

10 15 20 25 30

Page 55: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Test Central Differences in 2 Independent Groups

PARAMETRIC t-test for means, assumes normality

t.test(mpg ~ am, data = mtcars)

#### Welch Two Sample t-test#### data: mpg by am## t = -3.7671, df = 18.332, p-value = 0.001374## alternative hypothesis: true difference in means is not equal to 0## 95 percent confidence interval:## -11.280194 -3.209684## sample estimates:## mean in group automatic mean in group manual## 17.14737 24.39231

NON-PARAMETRIC Mann-Whitney U Test, based on ranks

wilcox.test(mpg ~ am, data = mtcars)

#### Wilcoxon rank sum test with continuity correction#### data: mpg by am## W = 42, p-value = 0.001871## alternative hypothesis: true location shift is not equal to 0

Page 56: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

More than Two Groups?

# plot to investigateboxplot(drat ~ cyl,

data = mtcars,main = "Between vs. Within",xlab = "Number of Cylinders",ylab = "Rear Axle Ratio",col = "light gray")

grid()

# we can use another packagelibrary(beeswarm)

stripchart(drat ~ cyl,data = mtcars,vertical = TRUE,method = 'jitter',jitter = 0.2,cex = 1,pch = 16,col = c("red",

"blue","dark green"),

add = TRUE)

4 6 8

3.0

3.5

4.0

4.5

5.0

Between vs. Within

Number of Cylinders

Rea

r A

xle

Rat

io

Page 57: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

ANOVA

# run the ANOVAanova1 = aov(drat ~ cyl, data = mtcars)summary(anova1)

## Df Sum Sq Mean Sq F value Pr(>F)## cyl 1 4.342 4.342 28.81 8.24e-06 ***## Residuals 30 4.521 0.151## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

# to get type III sums of squaresAnova(anova1, type = "III")

## Anova Table (Type III tests)#### Response: drat## Sum Sq Df F value Pr(>F)## (Intercept) 57.217 1 379.714 < 2.2e-16 ***## cyl 4.342 1 28.814 8.245e-06 ***## Residuals 4.521 30## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Page 58: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

ANCOVA

# add a continuous covariateanova2 = aov(drat ~ cyl + wt, data = mtcars)summary(anova2)

## Df Sum Sq Mean Sq F value Pr(>F)## cyl 1 4.342 4.342 32.284 3.83e-06 ***## wt 1 0.620 0.620 4.613 0.0402 *## Residuals 29 3.900 0.134## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Anova(anova2, type = "III")

## Anova Table (Type III tests)#### Response: drat## Sum Sq Df F value Pr(>F)## (Intercept) 56.578 1 420.6933 < 2e-16 ***## cyl 0.464 1 3.4493 0.07346 .## wt 0.620 1 4.6129 0.04022 *## Residuals 3.900 29## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Page 59: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Kruskal Wallis Test

# non-parametric version: uses ranks instead of meanskruskal.test(drat ~ cyl, data = mtcars)

#### Kruskal-Wallis rank sum test#### data: drat by cyl## Kruskal-Wallis chi-squared = 14.395, df = 2, p-value = 0.0007486

Page 60: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Simple Linear Regression: Fit Model# Simple Linear Regressionlinreg = lm(mpg ~ wt, data = mtcars)slr = summary(linreg)slr

#### Call:## lm(formula = mpg ~ wt, data = mtcars)#### Residuals:## Min 1Q Median 3Q Max## -4.5432 -2.3647 -0.1252 1.4096 6.8727#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 37.2851 1.8776 19.858 < 2e-16 ***## wt -5.3445 0.5591 -9.559 1.29e-10 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 3.046 on 30 degrees of freedom## Multiple R-squared: 0.7528,Adjusted R-squared: 0.7446## F-statistic: 91.38 on 1 and 30 DF, p-value: 1.294e-10

summary(linreg)$r.squared

## [1] 0.7528328

summary(linreg)$adj.r.squared

## [1] 0.7445939

Page 61: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Simple Linear Regression: Visualize the Fit

# Plot of relationship and least squares lineplot(mtcars$wt, mtcars$mpg)abline(linreg, col = "red")text(x = 2,

y = 12,labels = bquote(~R^2 ==

.(round(slr$r.squared, 3))),col = "red")

text(x = 4.75,y = 30,labels = bquote(~adj-R^2 ==

.(round(slr$adj.r.squared, 3))),col = "blue")

title(main = "Linear Regression")grid() 2 3 4 5

1015

2025

30

mtcars$wt

mtc

ars$

mpg

R2 = 0.753

adj − R2 = 0.745

Linear Regression

Page 62: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Introducing ggplot2

# a VERY COOL plotting package for next semester's workshop...library(ggplot2)

ggplot(mtcars, aes(x = wt, y = mpg)) +geom_point() +stat_smooth(method = "lm", col = "red") +facet_grid(. ~ am) +theme_bw()

automatic manual

10

20

30

2 3 4 5 2 3 4 5wt

mpg

Page 63: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Multiple Linear Regression: Fit the Model

# add several variables to the modellinreg2 = lm(mpg ~ wt + cyl + hp, data = mtcars)summary(linreg2)

#### Call:## lm(formula = mpg ~ wt + cyl + hp, data = mtcars)#### Residuals:## Min 1Q Median 3Q Max## -3.9290 -1.5598 -0.5311 1.1850 5.8986#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 38.75179 1.78686 21.687 < 2e-16 ***## wt -3.16697 0.74058 -4.276 0.000199 ***## cyl -0.94162 0.55092 -1.709 0.098480 .## hp -0.01804 0.01188 -1.519 0.140015## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 2.512 on 28 degrees of freedom## Multiple R-squared: 0.8431,Adjusted R-squared: 0.8263## F-statistic: 50.17 on 3 and 28 DF, p-value: 2.184e-11

Page 64: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Multiple Linear Regression: Residual Diagnostics

Distribution of Studentized Residuals

sresid

Den

sity

−2 −1 0 1 2 3

0.0

0.1

0.2

0.3

0.4

0.5

Page 65: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Logistic Regression: Fit the Model

# run the logistic regression (outcome has 2 levels)logreg = glm(am ~ mpg,

data = mtcars,family = binomial(link = "logit"))

summary(logreg)

#### Call:## glm(formula = am ~ mpg, family = binomial(link = "logit"), data = mtcars)#### Deviance Residuals:## Min 1Q Median 3Q Max## -1.5701 -0.7531 -0.4245 0.5866 2.0617#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) -6.6035 2.3514 -2.808 0.00498 **## mpg 0.3070 0.1148 2.673 0.00751 **## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for binomial family taken to be 1)#### Null deviance: 43.230 on 31 degrees of freedom## Residual deviance: 29.675 on 30 degrees of freedom## AIC: 33.675#### Number of Fisher Scoring iterations: 5

Page 66: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Logistic Regression: Visualize the Fit

10 15 20 25 30

0.0

0.2

0.4

0.6

0.8

1.0

Motor Trend Car Road Tests

Miles/(US) gallon

Tran

smis

sion

0

5

10

10

5

0

Aut

omat

ic v

s. M

anua

l

0.0

0.2

0.4

0.6

0.8

1.0

Page 67: Introduction to R: for Absolute Beginners...How to Speak R Getting to Know Your Data Fitting Statistical Models Introduction to R: for Absolute Beginners O ce of Methodological & Data

How to Speak R Getting to Know Your Data Fitting Statistical Models

Other Generalized Regresion Models# Can do other distributions and linkspoisreg = glm(carb ~ hp,

data = mtcars,family = poisson(link="log"))

summary(poisreg)

#### Call:## glm(formula = carb ~ hp, family = poisson(link = "log"), data = mtcars)#### Deviance Residuals:## Min 1Q Median 3Q Max## -0.86441 -0.55608 -0.07877 0.21395 1.49103#### Coefficients:## Estimate Std. Error z value Pr(>|z|)## (Intercept) 0.148971 0.265018 0.562 0.574## hp 0.005517 0.001387 3.977 6.97e-05 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### (Dispersion parameter for poisson family taken to be 1)#### Null deviance: 27.043 on 31 degrees of freedom## Residual deviance: 12.279 on 30 degrees of freedom## AIC: 105.64#### Number of Fisher Scoring iterations: 4