an introduction to statistical computing in r k2i data...

Post on 22-Jul-2020

1 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

An Introduction to Statistical Computing in RK2I Data Science Boot Camp - Day 1 AM Session

May 15, 2017

Statistical Computing in R May 15, 2017 1 / 55

AM Session Outline

Intro to R Basics

Plotting In R

Data Manipulation

Statistical Computing in R May 15, 2017 2 / 55

R Basics

Here we will give a quick overview of the R language and the RStudio IDE.

Our emphasis will be to explore the most used features of R, especiallythose used in later courses.

This won’t cover all the details, but will the most important parts.

Statistical Computing in R May 15, 2017 3 / 55

Working with Rstudio

Before beginning with R let’s orient ourselves with RStudio.

Statistical Computing in R May 15, 2017 4 / 55

Our initial view of RStudio is:

Statistical Computing in R May 15, 2017 5 / 55

Go to: File -> New File -> R Script. This gives:

Statistical Computing in R May 15, 2017 6 / 55

Statistical Computing in R May 15, 2017 7 / 55

Try It Out

Type the following into console

?lm

??linear

plot(1:20, 1:20)

Statistical Computing in R May 15, 2017 8 / 55

There are several useful shortcut keys in RStudio. A few popular ones:

Ctrl+Enter - When pressed in Editor, sends current line to console.

Ctrl+1, Ctrl+2 - switch between editor and console

Ctrl+Shift+Enter - run entire script in console

tab completion - this is perhaps the most used feature

For vim/emacs users Tools -> Global Options -> Code -> Keybindingswill give you your prefered bindings.

Statistical Computing in R May 15, 2017 9 / 55

It’s important to know our working directory.

Given a file name, R will assume it is located in your current workingdirectory.

R will also save output to the working directory by default.

It is important to set your working directory to the correct location orspecify full path names.

Statistical Computing in R May 15, 2017 10 / 55

Try out the following in the console window:

getwd()

list.files()

To change your working directory go to: Session -> Set Working Directory-> Choose Directory

Alternatively,

setwd("/path/to/directory")

Statistical Computing in R May 15, 2017 11 / 55

Reading, Writing, Saving, and Loading

Here we’ll look at bringing data into R and getting it out

We’ll also see how to save R objects and environments

Statistical Computing in R May 15, 2017 12 / 55

Reading In Data

read.table

read.csv

read.fwf

Check out options for each ?read.table

Statistical Computing in R May 15, 2017 13 / 55

Syntax

?read.table

?read.csv

read.table("/path/to/your/file.ext",

header=TRUE,

sep=",",

stringsAsFactors = FALSE)

Statistical Computing in R May 15, 2017 14 / 55

Most Common Options

sep tells how fields/variables are separated. Commons values are:

”,” (comma)

” ” (single space)

”\t” (tab escape character)

stringsAsFactors tells whether to treat non numeric values asfactor/categorical variables.

header tells whether first line of file has variable names

na.strings tells how missing values are encoded in the file.

Statistical Computing in R May 15, 2017 15 / 55

Standard Procedure

Open file in text editor

Check items relevant to options. Header? Separator type?

For big files, Linux tools are helpful: head -n10 BigFile.txt > OpenMe

Statistical Computing in R May 15, 2017 16 / 55

Try it Out

Let’s read in the ReadMeInX.txt files into R.

Try it on your own before looking at the answer on the next slides.

Example workflow:

1 Set your working directory to the directory containing the files.

2 Examine the files in a text editor to check for common options(header, separator, etc.)

Statistical Computing in R May 15, 2017 17 / 55

# read.table's default seperator ok for this one

set0 <- read.table("ReadMeIn0.txt",

header=TRUE)

# specify new seperator

set1 <- read.table("ReadMeIn1.txt",

header=TRUE,

sep=',')

# Or use read.csv

set1 <- read.csv("ReadMeIn1.txt",

header=TRUE)

Statistical Computing in R May 15, 2017 18 / 55

# another change of seperator

set2 <- read.table("ReadMeIn2.txt",

header=TRUE,

sep=';')

# check for missing

set3 <- read.table("ReadMeIn3.txt",

header=FALSE,

sep=',',

na.strings = '')

Statistical Computing in R May 15, 2017 19 / 55

Writing Data

write.table

write.csv

Statistical Computing in R May 15, 2017 20 / 55

Syntax and Common Options

?write.csv

write.csv(myRObject,

file="/path/to/save/spot/file.csv",

row.names=FALSE)

Options largely the same as their read counterparts

row.names = FALSE is helpful to avoid have 1,2,3,... as avariable/column

Statistical Computing in R May 15, 2017 21 / 55

Try It Out

Write out one of the files you imported. Try to varying options like sep,quote.

Statistical Computing in R May 15, 2017 22 / 55

Saving Objects

saveRDS/readRDS are used to save (compressed version of) individual Robjects

# save our data set

saveRDS(set1,file="TstObj.rds")

# get it back

newtst <- readRDS("TstObj.rds")

# can save any R object. Try a vector

my.vector <- c(1,8,-100)

saveRDS(my.vector, file="JustAVector.rds")

Statistical Computing in R May 15, 2017 23 / 55

Saving Environment

We can save all variables in the current R workspace with save.image

We can load in a saved workspace with load

R will ask you save your work when you exit

# Save all our work

save.image("AllMyWork.RData")

# Reload it

load("AllMyWork.RData")

# name given to default save

load(".RData")

Statistical Computing in R May 15, 2017 24 / 55

The Basics of R

Let’s do a whirlwind tour of R: it’s syntax and data structures

This won’t cover all the details, but will the most important parts

Statistical Computing in R May 15, 2017 25 / 55

Basic R Data Types

# numeric types: interger, double

348

# character

"my string"

# logical

TRUE

FALSE

# artithmetic as you'd expect

43 + 1 * 2^4

# so too logical operators/comparison

TRUE | FALSE

1 + 7 != 7

# Other logical operators:

# &, |, !

# <,>,<=,>=, ==, !=

Statistical Computing in R May 15, 2017 26 / 55

Data Types Cont.

# variables assignment is done with the <- operator

my.number <- 483

# the '.' above does nothing. we could have done:

# mynumber <- 483

# instead

# it's an Rism to use .'s in variable names.

# typeof() tells use type

typeof(my.number)

## [1] "double"

# we can convert between types

my.int <- as.integer(my.number)

typeof(my.int)

## [1] "integer"

# we can test for types

is.logical(my.int)

## [1] FALSE

Statistical Computing in R May 15, 2017 27 / 55

R Data Structures - Vectors

# the vector is the most important data structure

# create it with c()

my.vec <- c(1,2,67,-98)

# get some properties

str(my.vec)

## num [1:4] 1 2 67 -98

length(my.vec)

## [1] 4

# access elements with []

my.vec[3]

## [1] 67

my.vec[c(3,4)]

## [1] 67 -98

# can do assignment too

my.vec[5] <- 41.2

Statistical Computing in R May 15, 2017 28 / 55

Vectors - Cont.

# other ways to create vectors

x <- 1:6

y <- seq(7,12,by=1)

# Operations get recycled through whole vector

x + 1

## [1] 2 3 4 5 6 7

x > 3

## [1] FALSE FALSE FALSE TRUE TRUE TRUE

# Can do component wise operations between vectors

x * y

## [1] 7 16 27 40 55 72

x / y

## [1] 0.1428571 0.2500000 0.3333333 0.4000000 0.4545455 0.5000000

y %/% x

## [1] 7 4 3 2 2 2

Statistical Computing in R May 15, 2017 29 / 55

Try It Out

# Try guess what the following lines will do

# Will it run at all? If so, what will it give?

# Think about it and run to confirm

7 -> w

w <- z <- 44

1 + TRUE

0 | 15 & 3

my.vec[2:4]

my.vec[-2]

my.vec[c(TRUE,FALSE,FALSE,TRUE,FALSE)]

my.vec[

sum(

c(TRUE,FALSE,FALSE,TRUE,TRUE)

)

] <- TRUE

my.vec[3] <- "I'm a string"

as.numeric(my.vec)

x[x>3]

x + c(1,2)

Statistical Computing in R May 15, 2017 30 / 55

Matrices# matricies are 2d vectors.

# create using matrix()

my.matrix <- matrix(rnorm(20),nrow=4,ncol=5)

# rnorm() draws 20 random samples from a n(0,1) distribution

my.matrix

## [,1] [,2] [,3] [,4] [,5]

## [1,] 0.5351131 1.08710882 0.5670939 0.2800755 -0.8050743

## [2,] -1.9263838 0.86267009 0.7318280 0.4177110 -0.9576529

## [3,] -1.2931770 -1.03381286 -0.9035750 1.9787516 0.3747967

## [4,] -2.6190953 -0.04829205 1.3157181 1.2562005 0.1131199

# note matricies loaded by column

# Get details

dim(my.matrix)

## [1] 4 5

nrow(my.matrix)

## [1] 4

ncol(my.matrix)

## [1] 5

Statistical Computing in R May 15, 2017 31 / 55

Matrices - Cont.

# Indexing is similar to vectors but with 2 dimensions

# get second row

my.matrix[2,]

## [1] -1.9263838 0.8626701 0.7318280 0.4177110 -0.9576529

# get first,last columns of row three

my.matrix[3,c(1,4)]

## [1] -1.293177 1.978752

# transposing done with t()

Statistical Computing in R May 15, 2017 32 / 55

Lists# lists similar to vectors but contain different types

# create with list

my.list <- list("just a string",

44,

my.matrix,

c(TRUE,TRUE,FALSE))

# access items via double brackets [[]]

my.list[[4]]

## [1] TRUE TRUE FALSE

# access multiple items

my.list[1:2]

## [[1]]

## [1] "just a string"

##

## [[2]]

## [1] 44

# list items can be named too

named.list <- list(Item1="my string",

Item2=my.list)

# access of named item is via dollar sign operator

# [[]] also works

c(named.list$Item1,named.list[[1]])

## [1] "my string" "my string"

Statistical Computing in R May 15, 2017 33 / 55

Putting it together

Let’s practice with R data types by doing PCA on the iris data.

data("iris")

head(iris)

str(iris)

Note iris is a data.frame data type; this is simply a list.

Statistical Computing in R May 15, 2017 34 / 55

PCA outline

Save the numeric columns of iris as a matrix. (Hint: ?as.matrix)

Center and scale the matrix (Hint: ?scale)

Compute the correlation matrix

R =1

n − 1XTX

Here X is our (centered and scaled) data matrix, n is the number ofrows/observations in our data, and XT is the transpose of X .

(Hint: t(X) is transpose operator and A%*%B performs matrixmultiplication on the matricies A and B)

Statistical Computing in R May 15, 2017 35 / 55

PCA outline cont.

Obtain the two leading eigenvectors of the correlation matrix R.Denote these as v1, v2. (Hint: ?eigen)

Compute the first and second principle components via

z1 = Xv1

z2 = Xv2

Produce a scatter plot of z1 vs z2 (Hint: ?plot)

Take a few moments to try it yourself before looking at the answers on thenext slides.

Statistical Computing in R May 15, 2017 36 / 55

PCA from scratch

data("iris")

# get numeric portions of list and make a matrix

X <- as.matrix(iris[1:4])

# center and scale

X <- scale(X,center = TRUE,scale=TRUE)

# get the number of rows

n <- nrow(X)

# compute correlation matrix

R <- (1/(n-1))*t(X)%*%X

# perform eigen decomposition

Reig <- eigen(R)

# get eigen vectors

Reig.vecs <- Reig$vectors

# create principle components

pc1 <- X%*%Reig.vecs[,1]

pc2 <- X%*%Reig.vecs[,2]

Statistical Computing in R May 15, 2017 37 / 55

PCA from scratch cont.

# compare to R's PCA function

their.pcs <-prcomp(iris[1:4],center = TRUE,scale. = TRUE)

head(their.pcs$x[,1:2])

## PC1 PC2

## [1,] -2.257141 -0.4784238

## [2,] -2.074013 0.6718827

## [3,] -2.356335 0.3407664

## [4,] -2.291707 0.5953999

## [5,] -2.381863 -0.6446757

## [6,] -2.068701 -1.4842053

# our result

head(cbind(pc1,pc2))

## [,1] [,2]

## [1,] -2.257141 -0.4784238

## [2,] -2.074013 0.6718827

## [3,] -2.356335 0.3407664

## [4,] -2.291707 0.5953999

## [5,] -2.381863 -0.6446757

## [6,] -2.068701 -1.4842053

Statistical Computing in R May 15, 2017 38 / 55

PCA from scratch cont.

plot(pc1,pc2,col=iris$Species)

−3 −2 −1 0 1 2 3

−2

−1

01

2

pc1

pc2

Statistical Computing in R May 15, 2017 39 / 55

Factors# Factors are like vector, but with predefined allowed values called levels

# Factors are used to represent categorical variables in R

# create a factor

factor1 <- factor(c('Good','Bad','Ugly'))

# find it's levels

levels(factor1)

## [1] "Bad" "Good" "Ugly"

# below gives warning, but not error

factor1[4] <- 17

## Warning in ‘[<-.factor‘(‘*tmp*‘, 4, value = 17): invalid factor level, NA generated

# see what happened

factor1

## [1] Good Bad Ugly <NA>

## Levels: Bad Good Ugly

factor1[4] <- 'Bad'

# get the breakdown

table(factor1)

## factor1

## Bad Good Ugly

## 2 1 1

Statistical Computing in R May 15, 2017 40 / 55

Note one of our previous examples R filled in the improper factor valuewith NA

NA is R’s way of specifying missing data

Note the missing data is handled differently than ordinary values, as wewill see as we go along.

Statistical Computing in R May 15, 2017 41 / 55

Questions

What will the following lines of code do?

my.matrix[3:4,1:2] <- c(4,5)

my.matrix[4,5] <- 'string'

mf.strings <- c('F','F','M','F')

factor2 <- as.factor(mf.strings)

c(factor1, factor2)

factor1 == 'Ugly'

my.list[[3]][2,]

sum(c(1,2,3,NA))

sum(c(1,2,3,NA),na.rm = TRUE)

Statistical Computing in R May 15, 2017 42 / 55

Data Frames

The data.frame is how R represents data sets. They are simply lists, witha few additional restrictions.

# create your own

my.df <- data.frame(

age = c(45,27,19,59,71,13,5),

gender = factor(c('M','M','M','F','M','F','F'))

)

str(my.df)

## 'data.frame': 7 obs. of 2 variables:

## $ age : num 45 27 19 59 71 13 5

## $ gender: Factor w/ 2 levels "F","M": 2 2 2 1 2 1 1

Statistical Computing in R May 15, 2017 43 / 55

Data Frames - Cont.

Individual variables can be accessed via $ operator

my.df$age

## [1] 45 27 19 59 71 13 5

summary(my.df$age)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 5.00 16.00 27.00 34.14 52.00 71.00

table(my.df$gender)

##

## F M

## 3 4

# data frames are really just lists

my.df[[2]]

## [1] M M M F M F F

## Levels: F M

Statistical Computing in R May 15, 2017 44 / 55

Data Frames - Cont.

# data.frames can be subsetted like matrcies

my.df[1:3,c("age")]

## [1] 45 27 19

# logical subsetting especially useful for .data.frames

# get ages over 40

age.logic <- my.df$age > 40

# take a subset of these rows

my.df[age.logic,]

## age gender

## 1 45 M

## 4 59 F

## 5 71 M

# create a new variable age.sq

my.df$age.sq <- my.df$age^2

Statistical Computing in R May 15, 2017 45 / 55

Try It Out

Let’s use R’s internal iris data set to practice with data frames

my.iris <- iris

my.iris

1 Create two new variables Length.Sum and Width.Sum which are thesum of Sepal and Petal length/width respectively.

2 Use subsetting and R’s mean function to find the averageLength.Sum of setosa species

Statistical Computing in R May 15, 2017 46 / 55

my.iris$Length.Sum = my.iris$Sepal.Length +

my.iris$Petal.Length

my.iris$Width.Sum = my.iris$Sepal.Width +

my.iris$Petal.Width

setosa.inds <- my.iris$Species == 'setosa'

mean(my.iris[setosa.inds,]$Length.Sum)

## [1] 6.468

Statistical Computing in R May 15, 2017 47 / 55

Control Structures

R has all the typical control structures:

if-else statements

for loops

while loops

Statistical Computing in R May 15, 2017 48 / 55

Syntax

if(logical_expression){execute_code

} else{executre_other_code

}

for(value in sequence){work_with_value

}

while(expression_is_true){execute_code

}

Statistical Computing in R May 15, 2017 49 / 55

Functions

Defining functions is R is easy

# use function key word with assignment <-

my.mean <- function(input.vector){sum = 0

for(val in input.vector) {sum = sum + val

}# the expression get retuned

return.me <- sum / length(input.vector)

}my.mean(1:10)

Statistical Computing in R May 15, 2017 50 / 55

Functions cont.

my.mean <- function(input.vector){sum = 0

for(val in input.vector) {sum = sum + val

}# returns 1 now

retrun.me <- sum / length(input.vector)

1

}my.mean(1:10)

## [1] 1

Statistical Computing in R May 15, 2017 51 / 55

Try It Out

Create a function my.summary which inputs a vector, x, calculates themean, standard deviation, max, and min of x, and returns these in a list

Try out R’s internal functions mean, sd, max,min

Statistical Computing in R May 15, 2017 52 / 55

my.summary <- function(x) {list(

mean = mean(x),

sd = sd(x),

max = max(x),

min = min(x)

)

}

Statistical Computing in R May 15, 2017 53 / 55

Try It Out cont.

Loop through the variables in my.iris, evaluating my.summary on each(provided the variable is numeric) and printing the maximum.

Hint: Use is.numeric to test each variable before applying my.summary

Statistical Computing in R May 15, 2017 54 / 55

for(var in my.iris) {if(is.numeric(var)){tmp <- my.summary(var)

print(tmp$max)

}}

Statistical Computing in R May 15, 2017 55 / 55

top related