an introduction to statistical computing in r k2i data...

An Introduction to Statistical Computing in RK2I Data Science Boot Camp - Day 1 AM Session

May 15, 2017

Statistical Computing in R May 15, 2017 1 / 55

AM Session Outline

Intro to R Basics

Plotting In R

Data Manipulation

R Basics

Here we will give a quick overview of the R language and the RStudio IDE.

Our emphasis will be to explore the most used features of R, especiallythose used in later courses.

This won’t cover all the details, but will the most important parts.

Working with Rstudio

Before beginning with R let’s orient ourselves with RStudio.

Our initial view of RStudio is:

Go to: File -> New File -> R Script. This gives:

Try It Out

Type the following into console

??linear

plot(1:20, 1:20)

There are several useful shortcut keys in RStudio. A few popular ones:

Ctrl+Enter - When pressed in Editor, sends current line to console.

Ctrl+1, Ctrl+2 - switch between editor and console

Ctrl+Shift+Enter - run entire script in console

tab completion - this is perhaps the most used feature

For vim/emacs users Tools -> Global Options -> Code -> Keybindingswill give you your prefered bindings.

It’s important to know our working directory.

Given a file name, R will assume it is located in your current workingdirectory.

R will also save output to the working directory by default.

It is important to set your working directory to the correct location orspecify full path names.

Try out the following in the console window:

getwd()

list.files()

To change your working directory go to: Session -> Set Working Directory-> Choose Directory

Alternatively,

setwd("/path/to/directory")

Reading, Writing, Saving, and Loading

Here we’ll look at bringing data into R and getting it out

We’ll also see how to save R objects and environments

Reading In Data

read.table

read.csv

read.fwf

Check out options for each ?read.table

Syntax

?read.table

?read.csv

read.table("/path/to/your/file.ext",

header=TRUE,

sep=",",

stringsAsFactors = FALSE)

Most Common Options

sep tells how fields/variables are separated. Commons values are:

”,” (comma)

” ” (single space)

”\t” (tab escape character)

stringsAsFactors tells whether to treat non numeric values asfactor/categorical variables.

header tells whether first line of file has variable names

na.strings tells how missing values are encoded in the file.

Standard Procedure

Open file in text editor

Check items relevant to options. Header? Separator type?

For big files, Linux tools are helpful: head -n10 BigFile.txt > OpenMe

Try it Out

Let’s read in the ReadMeInX.txt files into R.

Try it on your own before looking at the answer on the next slides.

Example workflow:

1 Set your working directory to the directory containing the files.

2 Examine the files in a text editor to check for common options(header, separator, etc.)

# read.table's default seperator ok for this one

set0 <- read.table("ReadMeIn0.txt",

header=TRUE)

# specify new seperator

header=TRUE,

sep=',')

# Or use read.csv

set1 <- read.csv("ReadMeIn1.txt",

header=TRUE)

# another change of seperator

header=TRUE,

sep=';')

# check for missing

header=FALSE,

sep=',',

na.strings = '')

Writing Data

write.table

write.csv

Syntax and Common Options

?write.csv

write.csv(myRObject,

file="/path/to/save/spot/file.csv",

row.names=FALSE)

Options largely the same as their read counterparts

row.names = FALSE is helpful to avoid have 1,2,3,... as avariable/column

Try It Out

Write out one of the files you imported. Try to varying options like sep,quote.

Saving Objects

saveRDS/readRDS are used to save (compressed version of) individual Robjects

# save our data set

saveRDS(set1,file="TstObj.rds")

# get it back

newtst <- readRDS("TstObj.rds")

# can save any R object. Try a vector

my.vector <- c(1,8,-100)

saveRDS(my.vector, file="JustAVector.rds")

Saving Environment

We can save all variables in the current R workspace with save.image

We can load in a saved workspace with load

R will ask you save your work when you exit

# Save all our work

save.image("AllMyWork.RData")

# Reload it

load("AllMyWork.RData")

# name given to default save

load(".RData")

The Basics of R

Let’s do a whirlwind tour of R: it’s syntax and data structures

This won’t cover all the details, but will the most important parts

Basic R Data Types

# numeric types: interger, double

# character

"my string"

# logical

# artithmetic as you'd expect

43 + 1 * 2^4

# so too logical operators/comparison

TRUE | FALSE

1 + 7 != 7

# Other logical operators:

# &, |, !

# <,>,<=,>=, ==, !=

Data Types Cont.

# variables assignment is done with the <- operator

my.number <- 483

# the '.' above does nothing. we could have done:

# mynumber <- 483

# instead

# it's an Rism to use .'s in variable names.

# typeof() tells use type

typeof(my.number)

## [1] "double"

# we can convert between types

my.int <- as.integer(my.number)

typeof(my.int)

## [1] "integer"

# we can test for types

is.logical(my.int)

## [1] FALSE

R Data Structures - Vectors

# the vector is the most important data structure

# create it with c()

my.vec <- c(1,2,67,-98)

# get some properties

str(my.vec)

## num [1:4] 1 2 67 -98

length(my.vec)

## [1] 4

# access elements with []

my.vec[3]

## [1] 67

my.vec[c(3,4)]

## [1] 67 -98

# can do assignment too

my.vec[5] <- 41.2

Vectors - Cont.

# other ways to create vectors

x <- 1:6

y <- seq(7,12,by=1)

# Operations get recycled through whole vector

## [1] 2 3 4 5 6 7

## [1] FALSE FALSE FALSE TRUE TRUE TRUE

# Can do component wise operations between vectors

## [1] 7 16 27 40 55 72

## [1] 0.1428571 0.2500000 0.3333333 0.4000000 0.4545455 0.5000000

y %/% x

## [1] 7 4 3 2 2 2

Try It Out

# Try guess what the following lines will do

# Will it run at all? If so, what will it give?

# Think about it and run to confirm

7 -> w

w <- z <- 44

1 + TRUE

0 | 15 & 3

my.vec[2:4]

my.vec[-2]

my.vec[c(TRUE,FALSE,FALSE,TRUE,FALSE)]

my.vec[

c(TRUE,FALSE,FALSE,TRUE,TRUE)

] <- TRUE

my.vec[3] <- "I'm a string"

as.numeric(my.vec)

x[x>3]

x + c(1,2)

Matrices# matricies are 2d vectors.

# create using matrix()

my.matrix <- matrix(rnorm(20),nrow=4,ncol=5)

# rnorm() draws 20 random samples from a n(0,1) distribution

my.matrix

## [,1] [,2] [,3] [,4] [,5]

## [1,] 0.5351131 1.08710882 0.5670939 0.2800755 -0.8050743

## [2,] -1.9263838 0.86267009 0.7318280 0.4177110 -0.9576529

## [3,] -1.2931770 -1.03381286 -0.9035750 1.9787516 0.3747967

## [4,] -2.6190953 -0.04829205 1.3157181 1.2562005 0.1131199

# note matricies loaded by column

# Get details

dim(my.matrix)

## [1] 4 5

nrow(my.matrix)

## [1] 4

ncol(my.matrix)

## [1] 5

Matrices - Cont.

# Indexing is similar to vectors but with 2 dimensions

# get second row

my.matrix[2,]

## [1] -1.9263838 0.8626701 0.7318280 0.4177110 -0.9576529

# get first,last columns of row three

my.matrix[3,c(1,4)]

## [1] -1.293177 1.978752

# transposing done with t()

Lists# lists similar to vectors but contain different types

# create with list

my.list <- list("just a string",

my.matrix,

c(TRUE,TRUE,FALSE))

# access items via double brackets [[]]

my.list[[4]]

## [1] TRUE TRUE FALSE

# access multiple items

my.list[1:2]

## [[1]]

## [1] "just a string"

## [[2]]

## [1] 44

# list items can be named too

named.list <- list(Item1="my string",

Item2=my.list)

# access of named item is via dollar sign operator

# [[]] also works

c(named.list$Item1,named.list[[1]])

## [1] "my string" "my string"

Putting it together

Let’s practice with R data types by doing PCA on the iris data.

data("iris")

head(iris)

str(iris)

Note iris is a data.frame data type; this is simply a list.

PCA outline

Save the numeric columns of iris as a matrix. (Hint: ?as.matrix)

Center and scale the matrix (Hint: ?scale)

Compute the correlation matrix

n − 1XTX

Here X is our (centered and scaled) data matrix, n is the number ofrows/observations in our data, and XT is the transpose of X .

(Hint: t(X) is transpose operator and A%*%B performs matrixmultiplication on the matricies A and B)

PCA outline cont.

Obtain the two leading eigenvectors of the correlation matrix R.Denote these as v1, v2. (Hint: ?eigen)

Compute the first and second principle components via

z1 = Xv1

z2 = Xv2

Produce a scatter plot of z1 vs z2 (Hint: ?plot)

Take a few moments to try it yourself before looking at the answers on thenext slides.

PCA from scratch

data("iris")

# get numeric portions of list and make a matrix

X <- as.matrix(iris[1:4])

# center and scale

X <- scale(X,center = TRUE,scale=TRUE)

# get the number of rows

n <- nrow(X)

# compute correlation matrix

R <- (1/(n-1))*t(X)%*%X

# perform eigen decomposition

Reig <- eigen(R)

# get eigen vectors

Reig.vecs <- Reig$vectors

# create principle components

pc1 <- X%*%Reig.vecs[,1]

pc2 <- X%*%Reig.vecs[,2]

PCA from scratch cont.

# compare to R's PCA function

their.pcs <-prcomp(iris[1:4],center = TRUE,scale. = TRUE)

head(their.pcs$x[,1:2])

## PC1 PC2

## [1,] -2.257141 -0.4784238

## [2,] -2.074013 0.6718827

## [3,] -2.356335 0.3407664

## [4,] -2.291707 0.5953999

## [5,] -2.381863 -0.6446757

## [6,] -2.068701 -1.4842053

# our result

head(cbind(pc1,pc2))

## [,1] [,2]

## [1,] -2.257141 -0.4784238

## [2,] -2.074013 0.6718827

## [3,] -2.356335 0.3407664

## [4,] -2.291707 0.5953999

## [5,] -2.381863 -0.6446757

## [6,] -2.068701 -1.4842053

PCA from scratch cont.

plot(pc1,pc2,col=iris$Species)

−3 −2 −1 0 1 2 3

Factors# Factors are like vector, but with predefined allowed values called levels

# Factors are used to represent categorical variables in R

# create a factor

factor1 <- factor(c('Good','Bad','Ugly'))

# find it's levels

levels(factor1)

## [1] "Bad" "Good" "Ugly"

# below gives warning, but not error

factor1[4] <- 17

## Warning in ‘[<-.factor‘(‘*tmp*‘, 4, value = 17): invalid factor level, NA generated

# see what happened

factor1

## [1] Good Bad Ugly <NA>

## Levels: Bad Good Ugly

factor1[4] <- 'Bad'

# get the breakdown

table(factor1)

## factor1

## Bad Good Ugly

## 2 1 1

Note one of our previous examples R filled in the improper factor valuewith NA

NA is R’s way of specifying missing data

Note the missing data is handled differently than ordinary values, as wewill see as we go along.

Questions

What will the following lines of code do?

my.matrix[3:4,1:2] <- c(4,5)

my.matrix[4,5] <- 'string'

mf.strings <- c('F','F','M','F')

factor2 <- as.factor(mf.strings)

c(factor1, factor2)

factor1 == 'Ugly'

my.list[[3]][2,]

sum(c(1,2,3,NA))

sum(c(1,2,3,NA),na.rm = TRUE)

Data Frames

The data.frame is how R represents data sets. They are simply lists, witha few additional restrictions.

# create your own

my.df <- data.frame(

age = c(45,27,19,59,71,13,5),

gender = factor(c('M','M','M','F','M','F','F'))

str(my.df)

## 'data.frame': 7 obs. of 2 variables:

## $ age : num 45 27 19 59 71 13 5

## $ gender: Factor w/ 2 levels "F","M": 2 2 2 1 2 1 1

Data Frames - Cont.

Individual variables can be accessed via $ operator

my.df$age

## [1] 45 27 19 59 71 13 5

summary(my.df$age)

## Min. 1st Qu. Median Mean 3rd Qu. Max.

## 5.00 16.00 27.00 34.14 52.00 71.00

table(my.df$gender)

## F M

## 3 4

# data frames are really just lists

my.df[[2]]

## [1] M M M F M F F

## Levels: F M

Data Frames - Cont.

# data.frames can be subsetted like matrcies

my.df[1:3,c("age")]

## [1] 45 27 19

# logical subsetting especially useful for .data.frames

# get ages over 40

age.logic <- my.df$age > 40

# take a subset of these rows

my.df[age.logic,]

## age gender

## 1 45 M

## 4 59 F

## 5 71 M

# create a new variable age.sq

my.df$age.sq <- my.df$age^2

Try It Out

Let’s use R’s internal iris data set to practice with data frames

my.iris <- iris

my.iris

1 Create two new variables Length.Sum and Width.Sum which are thesum of Sepal and Petal length/width respectively.

2 Use subsetting and R’s mean function to find the averageLength.Sum of setosa species

my.iris$Length.Sum = my.iris$Sepal.Length +

my.iris$Petal.Length

my.iris$Width.Sum = my.iris$Sepal.Width +

my.iris$Petal.Width

setosa.inds <- my.iris$Species == 'setosa'

mean(my.iris[setosa.inds,]$Length.Sum)

## [1] 6.468

Control Structures

R has all the typical control structures:

if-else statements

for loops

while loops

Syntax

if(logical_expression){execute_code

} else{executre_other_code

for(value in sequence){work_with_value

while(expression_is_true){execute_code

Functions

Defining functions is R is easy

# use function key word with assignment <-

my.mean <- function(input.vector){sum = 0

for(val in input.vector) {sum = sum + val

}# the expression get retuned

return.me <- sum / length(input.vector)

}my.mean(1:10)

Functions cont.

my.mean <- function(input.vector){sum = 0

for(val in input.vector) {sum = sum + val

}# returns 1 now

retrun.me <- sum / length(input.vector)

}my.mean(1:10)

## [1] 1

Try It Out

Create a function my.summary which inputs a vector, x, calculates themean, standard deviation, max, and min of x, and returns these in a list

Try out R’s internal functions mean, sd, max,min

my.summary <- function(x) {list(

mean = mean(x),

sd = sd(x),

max = max(x),

min = min(x)

Try It Out cont.

Loop through the variables in my.iris, evaluating my.summary on each(provided the variable is numeric) and printing the maximum.

Hint: Use is.numeric to test each variable before applying my.summary

for(var in my.iris) {if(is.numeric(var)){tmp <- my.summary(var)

print(tmp$max)

an introduction to statistical computing in r k2i data...

Documents

r: graphicshji/courses/statcomputing/graphics1.pdf · r:...

the role of statistical computing in delivering quality amy...

sta141c: big data & high performance statistical computing...

math6030: statistical computing ii

biometry and statistical computing - agronomy

anyons and topological quantum computing · pdf fileanyons...

applications of soft computing and statistical …

applied soft computing - indian statistical...

workshop: introduction to statistical computing with -...

g. cowan computing and statistical data analysis / stat 4 1...

sta141c: big data & high performance statistical...

advanced statistical computing - george box - in remembrance

editorial - the r project for statistical computing

k2i - training, supplier & service

simple parallel statistical computing in...

mathematical and statistical computing laboratory

sta141c: big data & high performance statistical computing

introduction to the r statistical computing environment

r: the r project for statistical computing

high performance statistical computing