an introduction to statistical computing in r k2i data...

55
An Introduction to Statistical Computing in R K2I Data Science Boot Camp - Day 1 AM Session May 15, 2017 Statistical Computing in R May 15, 2017 1 / 55

Upload: others

Post on 22-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

An Introduction to Statistical Computing in RK2I Data Science Boot Camp - Day 1 AM Session

May 15, 2017

Statistical Computing in R May 15, 2017 1 / 55

Page 2: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

AM Session Outline

Intro to R Basics

Plotting In R

Data Manipulation

Statistical Computing in R May 15, 2017 2 / 55

Page 3: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

R Basics

Here we will give a quick overview of the R language and the RStudio IDE.

Our emphasis will be to explore the most used features of R, especiallythose used in later courses.

This won’t cover all the details, but will the most important parts.

Statistical Computing in R May 15, 2017 3 / 55

Page 4: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Working with Rstudio

Before beginning with R let’s orient ourselves with RStudio.

Statistical Computing in R May 15, 2017 4 / 55

Page 5: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Our initial view of RStudio is:

Statistical Computing in R May 15, 2017 5 / 55

Page 6: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Go to: File -> New File -> R Script. This gives:

Statistical Computing in R May 15, 2017 6 / 55

Page 7: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Statistical Computing in R May 15, 2017 7 / 55

Page 8: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Try It Out

Type the following into console

?lm

??linear

plot(1:20, 1:20)

Statistical Computing in R May 15, 2017 8 / 55

Page 9: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

There are several useful shortcut keys in RStudio. A few popular ones:

Ctrl+Enter - When pressed in Editor, sends current line to console.

Ctrl+1, Ctrl+2 - switch between editor and console

Ctrl+Shift+Enter - run entire script in console

tab completion - this is perhaps the most used feature

For vim/emacs users Tools -> Global Options -> Code -> Keybindingswill give you your prefered bindings.

Statistical Computing in R May 15, 2017 9 / 55

Page 10: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

It’s important to know our working directory.

Given a file name, R will assume it is located in your current workingdirectory.

R will also save output to the working directory by default.

It is important to set your working directory to the correct location orspecify full path names.

Statistical Computing in R May 15, 2017 10 / 55

Page 11: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Try out the following in the console window:

getwd()

list.files()

To change your working directory go to: Session -> Set Working Directory-> Choose Directory

Alternatively,

setwd("/path/to/directory")

Statistical Computing in R May 15, 2017 11 / 55

Page 12: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Reading, Writing, Saving, and Loading

Here we’ll look at bringing data into R and getting it out

We’ll also see how to save R objects and environments

Statistical Computing in R May 15, 2017 12 / 55

Page 13: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Reading In Data

read.table

read.csv

read.fwf

Check out options for each ?read.table

Statistical Computing in R May 15, 2017 13 / 55

Page 14: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Syntax

?read.table

?read.csv

read.table("/path/to/your/file.ext",

header=TRUE,

sep=",",

stringsAsFactors = FALSE)

Statistical Computing in R May 15, 2017 14 / 55

Page 15: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Most Common Options

sep tells how fields/variables are separated. Commons values are:

”,” (comma)

” ” (single space)

”\t” (tab escape character)

stringsAsFactors tells whether to treat non numeric values asfactor/categorical variables.

header tells whether first line of file has variable names

na.strings tells how missing values are encoded in the file.

Statistical Computing in R May 15, 2017 15 / 55

Page 16: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Standard Procedure

Open file in text editor

Check items relevant to options. Header? Separator type?

For big files, Linux tools are helpful: head -n10 BigFile.txt > OpenMe

Statistical Computing in R May 15, 2017 16 / 55

Page 17: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Try it Out

Let’s read in the ReadMeInX.txt files into R.

Try it on your own before looking at the answer on the next slides.

Example workflow:

1 Set your working directory to the directory containing the files.

2 Examine the files in a text editor to check for common options(header, separator, etc.)

Statistical Computing in R May 15, 2017 17 / 55

Page 18: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

# read.table's default seperator ok for this one

set0 <- read.table("ReadMeIn0.txt",

header=TRUE)

# specify new seperator

set1 <- read.table("ReadMeIn1.txt",

header=TRUE,

sep=',')

# Or use read.csv

set1 <- read.csv("ReadMeIn1.txt",

header=TRUE)

Statistical Computing in R May 15, 2017 18 / 55

Page 19: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

# another change of seperator

set2 <- read.table("ReadMeIn2.txt",

header=TRUE,

sep=';')

# check for missing

set3 <- read.table("ReadMeIn3.txt",

header=FALSE,

sep=',',

na.strings = '')

Statistical Computing in R May 15, 2017 19 / 55

Page 20: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Writing Data

write.table

write.csv

Statistical Computing in R May 15, 2017 20 / 55

Page 21: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Syntax and Common Options

?write.csv

write.csv(myRObject,

file="/path/to/save/spot/file.csv",

row.names=FALSE)

Options largely the same as their read counterparts

row.names = FALSE is helpful to avoid have 1,2,3,... as avariable/column

Statistical Computing in R May 15, 2017 21 / 55

Page 22: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Try It Out

Write out one of the files you imported. Try to varying options like sep,quote.

Statistical Computing in R May 15, 2017 22 / 55

Page 23: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Saving Objects

saveRDS/readRDS are used to save (compressed version of) individual Robjects

# save our data set

saveRDS(set1,file="TstObj.rds")

# get it back

newtst <- readRDS("TstObj.rds")

# can save any R object. Try a vector

my.vector <- c(1,8,-100)

saveRDS(my.vector, file="JustAVector.rds")

Statistical Computing in R May 15, 2017 23 / 55

Page 24: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Saving Environment

We can save all variables in the current R workspace with save.image

We can load in a saved workspace with load

R will ask you save your work when you exit

# Save all our work

save.image("AllMyWork.RData")

# Reload it

load("AllMyWork.RData")

# name given to default save

load(".RData")

Statistical Computing in R May 15, 2017 24 / 55

Page 25: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

The Basics of R

Let’s do a whirlwind tour of R: it’s syntax and data structures

This won’t cover all the details, but will the most important parts

Statistical Computing in R May 15, 2017 25 / 55

Page 26: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Basic R Data Types

# numeric types: interger, double

348

# character

"my string"

# logical

TRUE

FALSE

# artithmetic as you'd expect

43 + 1 * 2^4

# so too logical operators/comparison

TRUE | FALSE

1 + 7 != 7

# Other logical operators:

# &, |, !

# <,>,<=,>=, ==, !=

Statistical Computing in R May 15, 2017 26 / 55

Page 27: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Data Types Cont.

# variables assignment is done with the <- operator

my.number <- 483

# the '.' above does nothing. we could have done:

# mynumber <- 483

# instead

# it's an Rism to use .'s in variable names.

# typeof() tells use type

typeof(my.number)

## [1] "double"

# we can convert between types

my.int <- as.integer(my.number)

typeof(my.int)

## [1] "integer"

# we can test for types

is.logical(my.int)

## [1] FALSE

Statistical Computing in R May 15, 2017 27 / 55

Page 28: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

R Data Structures - Vectors

# the vector is the most important data structure

# create it with c()

my.vec <- c(1,2,67,-98)

# get some properties

str(my.vec)

## num [1:4] 1 2 67 -98

length(my.vec)

## [1] 4

# access elements with []

my.vec[3]

## [1] 67

my.vec[c(3,4)]

## [1] 67 -98

# can do assignment too

my.vec[5] <- 41.2

Statistical Computing in R May 15, 2017 28 / 55

Page 29: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Vectors - Cont.

# other ways to create vectors

x <- 1:6

y <- seq(7,12,by=1)

# Operations get recycled through whole vector

x + 1

## [1] 2 3 4 5 6 7

x > 3

## [1] FALSE FALSE FALSE TRUE TRUE TRUE

# Can do component wise operations between vectors

x * y

## [1] 7 16 27 40 55 72

x / y

## [1] 0.1428571 0.2500000 0.3333333 0.4000000 0.4545455 0.5000000

y %/% x

## [1] 7 4 3 2 2 2

Statistical Computing in R May 15, 2017 29 / 55

Page 30: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Try It Out

# Try guess what the following lines will do

# Will it run at all? If so, what will it give?

# Think about it and run to confirm

7 -> w

w <- z <- 44

1 + TRUE

0 | 15 & 3

my.vec[2:4]

my.vec[-2]

my.vec[c(TRUE,FALSE,FALSE,TRUE,FALSE)]

my.vec[

sum(

c(TRUE,FALSE,FALSE,TRUE,TRUE)

)

] <- TRUE

my.vec[3] <- "I'm a string"

as.numeric(my.vec)

x[x>3]

x + c(1,2)

Statistical Computing in R May 15, 2017 30 / 55

Page 31: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Matrices# matricies are 2d vectors.

# create using matrix()

my.matrix <- matrix(rnorm(20),nrow=4,ncol=5)

# rnorm() draws 20 random samples from a n(0,1) distribution

my.matrix

## [,1] [,2] [,3] [,4] [,5]

## [1,] 0.5351131 1.08710882 0.5670939 0.2800755 -0.8050743

## [2,] -1.9263838 0.86267009 0.7318280 0.4177110 -0.9576529

## [3,] -1.2931770 -1.03381286 -0.9035750 1.9787516 0.3747967

## [4,] -2.6190953 -0.04829205 1.3157181 1.2562005 0.1131199

# note matricies loaded by column

# Get details

dim(my.matrix)

## [1] 4 5

nrow(my.matrix)

## [1] 4

ncol(my.matrix)

## [1] 5

Statistical Computing in R May 15, 2017 31 / 55

Page 32: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Matrices - Cont.

# Indexing is similar to vectors but with 2 dimensions

# get second row

my.matrix[2,]

## [1] -1.9263838 0.8626701 0.7318280 0.4177110 -0.9576529

# get first,last columns of row three

my.matrix[3,c(1,4)]

## [1] -1.293177 1.978752

# transposing done with t()

Statistical Computing in R May 15, 2017 32 / 55

Page 33: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Lists# lists similar to vectors but contain different types

# create with list

my.list <- list("just a string",

44,

my.matrix,

c(TRUE,TRUE,FALSE))

# access items via double brackets [[]]

my.list[[4]]

## [1] TRUE TRUE FALSE

# access multiple items

my.list[1:2]

## [[1]]

## [1] "just a string"

##

## [[2]]

## [1] 44

# list items can be named too

named.list <- list(Item1="my string",

Item2=my.list)

# access of named item is via dollar sign operator

# [[]] also works

c(named.list$Item1,named.list[[1]])

## [1] "my string" "my string"

Statistical Computing in R May 15, 2017 33 / 55

Page 34: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Putting it together

Let’s practice with R data types by doing PCA on the iris data.

data("iris")

head(iris)

str(iris)

Note iris is a data.frame data type; this is simply a list.

Statistical Computing in R May 15, 2017 34 / 55

Page 35: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

PCA outline

Save the numeric columns of iris as a matrix. (Hint: ?as.matrix)

Center and scale the matrix (Hint: ?scale)

Compute the correlation matrix

R =1

n − 1XTX

Here X is our (centered and scaled) data matrix, n is the number ofrows/observations in our data, and XT is the transpose of X .

(Hint: t(X) is transpose operator and A%*%B performs matrixmultiplication on the matricies A and B)

Statistical Computing in R May 15, 2017 35 / 55

Page 36: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

PCA outline cont.

Obtain the two leading eigenvectors of the correlation matrix R.Denote these as v1, v2. (Hint: ?eigen)

Compute the first and second principle components via

z1 = Xv1

z2 = Xv2

Produce a scatter plot of z1 vs z2 (Hint: ?plot)

Take a few moments to try it yourself before looking at the answers on thenext slides.

Statistical Computing in R May 15, 2017 36 / 55

Page 37: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

PCA from scratch

data("iris")

# get numeric portions of list and make a matrix

X <- as.matrix(iris[1:4])

# center and scale

X <- scale(X,center = TRUE,scale=TRUE)

# get the number of rows

n <- nrow(X)

# compute correlation matrix

R <- (1/(n-1))*t(X)%*%X

# perform eigen decomposition

Reig <- eigen(R)

# get eigen vectors

Reig.vecs <- Reig$vectors

# create principle components

pc1 <- X%*%Reig.vecs[,1]

pc2 <- X%*%Reig.vecs[,2]

Statistical Computing in R May 15, 2017 37 / 55

Page 38: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

PCA from scratch cont.

# compare to R's PCA function

their.pcs <-prcomp(iris[1:4],center = TRUE,scale. = TRUE)

head(their.pcs$x[,1:2])

## PC1 PC2

## [1,] -2.257141 -0.4784238

## [2,] -2.074013 0.6718827

## [3,] -2.356335 0.3407664

## [4,] -2.291707 0.5953999

## [5,] -2.381863 -0.6446757

## [6,] -2.068701 -1.4842053

# our result

head(cbind(pc1,pc2))

## [,1] [,2]

## [1,] -2.257141 -0.4784238

## [2,] -2.074013 0.6718827

## [3,] -2.356335 0.3407664

## [4,] -2.291707 0.5953999

## [5,] -2.381863 -0.6446757

## [6,] -2.068701 -1.4842053

Statistical Computing in R May 15, 2017 38 / 55

Page 39: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

PCA from scratch cont.

plot(pc1,pc2,col=iris$Species)

−3 −2 −1 0 1 2 3

−2

−1

01

2

pc1

pc2

Statistical Computing in R May 15, 2017 39 / 55

Page 40: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Factors# Factors are like vector, but with predefined allowed values called levels

# Factors are used to represent categorical variables in R

# create a factor

factor1 <- factor(c('Good','Bad','Ugly'))

# find it's levels

levels(factor1)

## [1] "Bad" "Good" "Ugly"

# below gives warning, but not error

factor1[4] <- 17

## Warning in ‘[<-.factor‘(‘*tmp*‘, 4, value = 17): invalid factor level, NA generated

# see what happened

factor1

## [1] Good Bad Ugly <NA>

## Levels: Bad Good Ugly

factor1[4] <- 'Bad'

# get the breakdown

table(factor1)

## factor1

## Bad Good Ugly

## 2 1 1

Statistical Computing in R May 15, 2017 40 / 55

Page 41: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Note one of our previous examples R filled in the improper factor valuewith NA

NA is R’s way of specifying missing data

Note the missing data is handled differently than ordinary values, as wewill see as we go along.

Statistical Computing in R May 15, 2017 41 / 55

Page 42: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Questions

What will the following lines of code do?

my.matrix[3:4,1:2] <- c(4,5)

my.matrix[4,5] <- 'string'

mf.strings <- c('F','F','M','F')

factor2 <- as.factor(mf.strings)

c(factor1, factor2)

factor1 == 'Ugly'

my.list[[3]][2,]

sum(c(1,2,3,NA))

sum(c(1,2,3,NA),na.rm = TRUE)

Statistical Computing in R May 15, 2017 42 / 55

Page 43: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Data Frames

The data.frame is how R represents data sets. They are simply lists, witha few additional restrictions.

# create your own

my.df <- data.frame(

age = c(45,27,19,59,71,13,5),

gender = factor(c('M','M','M','F','M','F','F'))

)

str(my.df)

## 'data.frame': 7 obs. of 2 variables:

## $ age : num 45 27 19 59 71 13 5

## $ gender: Factor w/ 2 levels "F","M": 2 2 2 1 2 1 1

Statistical Computing in R May 15, 2017 43 / 55

Page 44: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Data Frames - Cont.

Individual variables can be accessed via $ operator

my.df$age

## [1] 45 27 19 59 71 13 5

summary(my.df$age)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 5.00 16.00 27.00 34.14 52.00 71.00

table(my.df$gender)

##

## F M

## 3 4

# data frames are really just lists

my.df[[2]]

## [1] M M M F M F F

## Levels: F M

Statistical Computing in R May 15, 2017 44 / 55

Page 45: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Data Frames - Cont.

# data.frames can be subsetted like matrcies

my.df[1:3,c("age")]

## [1] 45 27 19

# logical subsetting especially useful for .data.frames

# get ages over 40

age.logic <- my.df$age > 40

# take a subset of these rows

my.df[age.logic,]

## age gender

## 1 45 M

## 4 59 F

## 5 71 M

# create a new variable age.sq

my.df$age.sq <- my.df$age^2

Statistical Computing in R May 15, 2017 45 / 55

Page 46: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Try It Out

Let’s use R’s internal iris data set to practice with data frames

my.iris <- iris

my.iris

1 Create two new variables Length.Sum and Width.Sum which are thesum of Sepal and Petal length/width respectively.

2 Use subsetting and R’s mean function to find the averageLength.Sum of setosa species

Statistical Computing in R May 15, 2017 46 / 55

Page 47: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

my.iris$Length.Sum = my.iris$Sepal.Length +

my.iris$Petal.Length

my.iris$Width.Sum = my.iris$Sepal.Width +

my.iris$Petal.Width

setosa.inds <- my.iris$Species == 'setosa'

mean(my.iris[setosa.inds,]$Length.Sum)

## [1] 6.468

Statistical Computing in R May 15, 2017 47 / 55

Page 48: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Control Structures

R has all the typical control structures:

if-else statements

for loops

while loops

Statistical Computing in R May 15, 2017 48 / 55

Page 49: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Syntax

if(logical_expression){execute_code

} else{executre_other_code

}

for(value in sequence){work_with_value

}

while(expression_is_true){execute_code

}

Statistical Computing in R May 15, 2017 49 / 55

Page 50: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Functions

Defining functions is R is easy

# use function key word with assignment <-

my.mean <- function(input.vector){sum = 0

for(val in input.vector) {sum = sum + val

}# the expression get retuned

return.me <- sum / length(input.vector)

}my.mean(1:10)

Statistical Computing in R May 15, 2017 50 / 55

Page 51: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Functions cont.

my.mean <- function(input.vector){sum = 0

for(val in input.vector) {sum = sum + val

}# returns 1 now

retrun.me <- sum / length(input.vector)

1

}my.mean(1:10)

## [1] 1

Statistical Computing in R May 15, 2017 51 / 55

Page 52: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Try It Out

Create a function my.summary which inputs a vector, x, calculates themean, standard deviation, max, and min of x, and returns these in a list

Try out R’s internal functions mean, sd, max,min

Statistical Computing in R May 15, 2017 52 / 55

Page 53: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

my.summary <- function(x) {list(

mean = mean(x),

sd = sd(x),

max = max(x),

min = min(x)

)

}

Statistical Computing in R May 15, 2017 53 / 55

Page 54: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

Try It Out cont.

Loop through the variables in my.iris, evaluating my.summary on each(provided the variable is numeric) and printing the maximum.

Hint: Use is.numeric to test each variable before applying my.summary

Statistical Computing in R May 15, 2017 54 / 55

Page 55: An Introduction to Statistical Computing in R K2I Data ...jn13/slides/pdfs/Day1AMSlides1_RIntro.pdf · Statistical Computing in R May 15, 2017 35 / 55. PCA outline cont. Obtain the

for(var in my.iris) {if(is.numeric(var)){tmp <- my.summary(var)

print(tmp$max)

}}

Statistical Computing in R May 15, 2017 55 / 55