introduction to to r emily kalah gade university of washington credit to kristin siebel for...

31
Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Upload: john-norris

Post on 29-Dec-2015

216 views

Category:

Documents


2 download

TRANSCRIPT

Page 1: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Introduction to to REmily Kalah Gade

University of Washington

Credit to Kristin Siebel for development of much of this PowerPoint

Page 2: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Overview

I. What is R?II. The R EnvironmentIII. Reading in DataIV. Viewing and Manipulating DataV. Data Analysis

Page 3: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

What is R? Full programming environment Language: entirely command-driven Object-oriented

Page 4: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Why Use R?

• Free!• Extremely flexible• Many additional packages available• Excellent graphics

Disadvantages• Steep learning curve• Difficult data entry

Page 5: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Download R + Packages

Download R:

http://cran.r-project.org

Available for Linux, MacOS, and Windows

Packages Collection of functions for specific tasks (1000s of them)

Come with reference manual and vignettes /sample code Search for packages relevant to your area of interest:

Google scholar for papers introducing new packages R-bloggers

Page 6: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Hints to Remember• R is case-sensitive: X is not the same as x• Assignment operator: = or <-• Objects need to be assigned a name, otherwise they

get dumped to main window, not saved to the environment.

• Use a text editor, not MS Word! Using a basic Textpad, or even R’s built-in editor keeps extraneous symbols out of your code, and quotation marks non-directional

Page 7: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

The R Environment

A traditional stats program like SPSS or Stata only contains one rectangular dataset at a time. All analysis is done on the current dataset.

In contrast, the R environment is like a sandbox.

It can contain a large number of different objects.

Page 8: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Rectangular Dataset(Excel, SPSS, Stata, SAS)

Variable 1 Variable 2 Variable 3

Case 1

Case 2

Case 3

Case 4

Case 5

Page 9: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

R Environment (Object-Oriented): Objects have both Type and Mode

Function 1

Function 2

Results

Vector 1

Vector 2

Matrix

Data Frame

String

Numeric Value

Page 10: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

The R Environment

R is also function-driven.

The functions act on objects and return objects.

Functions themselves are objects, too!

function works its black-box magic!

InputArguments(Objects)

Output(Objects)

Page 11: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Help Function

help(function name) help.search(“search term”) Try: help(lm), ?lm, and help.search(“linear regression”)

Sometimes one help file will contain information for several functions.

Usage: Shows syntax for command and required arguments (input) and any default values for arguments.

Page 12: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Creating Objects

Object Create Function

vector c(), vector()

factor factor()

matrix matrix()

data frame data.frame()

Page 13: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Common Mode Types

Mode Possible Values

Logical TRUE or FALSE or NA

Integer Whole numbers

Numeric Real numbers

Character Single character or String (in double quotes)

Page 14: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Common Object Types

Object Modes More than one mode?

vector Logical, Char, or Numeric

No

factor Logical, Char, or Numeric

No

matrix Logical, Char, or Numeric

No

data frame Logical, Char, and Numeric

Yes

Page 15: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Reading in Data

read.table(filename, ...)

> sts = read.csv(“C:/temp/statex77.csv”)

Use CSV (comma-separated values) format. Almost every stats program will export to this format.

Page 16: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Viewing DataWhat does the dataset look like?> str(sts)> attributes(sts)> colnames(sts)

You can also assign row/col names with these functions.

> dim(sts)> nrow(sts)> ncol(sts)

Page 17: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Viewing Data: Indexing

datasetname[rownum, columnnum]

> sts[1,4]

displays value at row 1, column 4

> sts[2:5, 6]

displays rows 2-5, column 6

Page 18: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Viewing Data: Indexing

> sts[,2]

displays all rows, column 2

> sts[4,]

displays row 4, all columns

> head(sts)

shows the first 10 rows of the data frame

Page 19: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Viewing Data• You can also access columns (variables) using the ‘$’

symbol if the data frame has column names:

> sts$X[30:35]

Page 20: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Manipulating Data Frames• Now we can give that first column (variable) a better

name than “X”.

> colnames(sts) = c(“state”, colnames(sts)[2:ncol(sts)])

Page 21: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Manipulating Data Frames

> str(sts)

R has the unfortunate habit of trying to turn vectors of character strings into factors (categorical data).

> sts$state = as.character(sts$state)

Page 22: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Manipulating Data: Operators

Arithmetic: + - * / ^

Comparison

< less than

> greater than

<= less than or equal to

>= greater than or equal to

== is equal to

!= is not equal to

Logical

! not

& and

| or

xor() exclusive or

Page 23: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Viewing Data: Using Operators

Viewing subsets of data using column names and operators:

> sts[sts$state == “Washington”,]

> sts[sts$Illiteracy >= 1.0,]

> sts$state[sts$Area > 100000]

> sts$state[sts$Life.Exp > 70]

Page 24: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Analyzing Data

What do the variables look like?

> table(sts$Illiteracy)

> hist(sts$Area)

> mean(sts$Life.Exp)

> sd(sts$Life.Exp)

> cor(sts$Illiteracy, sts$HS.Grad)

> mean(sts$Income[sts$Illiteracy >= 1.0])

Page 25: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Manipulating Data

Transforming variables:

> Pop.Density = sts$Population/sts$Area

This creates a new vector called Pop.Density of length 50 (our number of cases).

Page 26: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Manipulating Data

We can use Pop.Density without “adding” it to our dataframe.

But if you like the rectangular dataset concept, you can column bind it to the existing dataframe:

> sts = cbind(sts, Pop.Density)

Page 27: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Data Analysis

Hypothesis Testing

t.test, prop.test

Regression

lm(), glm()

Page 28: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Data Analysis: OLS Regression

> m1 = lm(Income ~ Illiteracy + log(Pop.Density) + HS.Grad + Murder, data=sts)

The output of the regression is also an object. We’ve named it m1.

> summary(m1)

Page 29: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Saving Data

You can use write.csv() or write.table() to save your dataset.

When you quit R, it will ask if you want to save the workspace. This includes all the objects you have created, but it does not include the code you’ve written. You can also use save.image() to save the workspace.

You should always save your code in a *.r file.

Page 30: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Other Useful Functions

> ifelse()

> is.na()

> match()

> merge()

> apply()

> order()

> sort()

Page 31: Introduction to to R Emily Kalah Gade University of Washington Credit to Kristin Siebel for development of much of this PowerPoint

Advanced Topics More on factors Lists (data type) Loops String manipulation Writing your own functions Graphics