introduction to to r emily kalah gade university of washington credit to kristin siebel for...
TRANSCRIPT
Introduction to to REmily Kalah Gade
University of Washington
Credit to Kristin Siebel for development of much of this PowerPoint
Overview
I. What is R?II. The R EnvironmentIII. Reading in DataIV. Viewing and Manipulating DataV. Data Analysis
What is R? Full programming environment Language: entirely command-driven Object-oriented
Why Use R?
• Free!• Extremely flexible• Many additional packages available• Excellent graphics
Disadvantages• Steep learning curve• Difficult data entry
Download R + Packages
Download R:
http://cran.r-project.org
Available for Linux, MacOS, and Windows
Packages Collection of functions for specific tasks (1000s of them)
Come with reference manual and vignettes /sample code Search for packages relevant to your area of interest:
Google scholar for papers introducing new packages R-bloggers
Hints to Remember• R is case-sensitive: X is not the same as x• Assignment operator: = or <-• Objects need to be assigned a name, otherwise they
get dumped to main window, not saved to the environment.
• Use a text editor, not MS Word! Using a basic Textpad, or even R’s built-in editor keeps extraneous symbols out of your code, and quotation marks non-directional
The R Environment
A traditional stats program like SPSS or Stata only contains one rectangular dataset at a time. All analysis is done on the current dataset.
In contrast, the R environment is like a sandbox.
It can contain a large number of different objects.
Rectangular Dataset(Excel, SPSS, Stata, SAS)
Variable 1 Variable 2 Variable 3
Case 1
Case 2
Case 3
Case 4
Case 5
R Environment (Object-Oriented): Objects have both Type and Mode
Function 1
Function 2
Results
Vector 1
Vector 2
Matrix
Data Frame
String
Numeric Value
The R Environment
R is also function-driven.
The functions act on objects and return objects.
Functions themselves are objects, too!
function works its black-box magic!
InputArguments(Objects)
Output(Objects)
Help Function
help(function name) help.search(“search term”) Try: help(lm), ?lm, and help.search(“linear regression”)
Sometimes one help file will contain information for several functions.
Usage: Shows syntax for command and required arguments (input) and any default values for arguments.
Creating Objects
Object Create Function
vector c(), vector()
factor factor()
matrix matrix()
data frame data.frame()
Common Mode Types
Mode Possible Values
Logical TRUE or FALSE or NA
Integer Whole numbers
Numeric Real numbers
Character Single character or String (in double quotes)
Common Object Types
Object Modes More than one mode?
vector Logical, Char, or Numeric
No
factor Logical, Char, or Numeric
No
matrix Logical, Char, or Numeric
No
data frame Logical, Char, and Numeric
Yes
Reading in Data
read.table(filename, ...)
> sts = read.csv(“C:/temp/statex77.csv”)
Use CSV (comma-separated values) format. Almost every stats program will export to this format.
Viewing DataWhat does the dataset look like?> str(sts)> attributes(sts)> colnames(sts)
You can also assign row/col names with these functions.
> dim(sts)> nrow(sts)> ncol(sts)
Viewing Data: Indexing
datasetname[rownum, columnnum]
> sts[1,4]
displays value at row 1, column 4
> sts[2:5, 6]
displays rows 2-5, column 6
Viewing Data: Indexing
> sts[,2]
displays all rows, column 2
> sts[4,]
displays row 4, all columns
> head(sts)
shows the first 10 rows of the data frame
Viewing Data• You can also access columns (variables) using the ‘$’
symbol if the data frame has column names:
> sts$X[30:35]
Manipulating Data Frames• Now we can give that first column (variable) a better
name than “X”.
> colnames(sts) = c(“state”, colnames(sts)[2:ncol(sts)])
Manipulating Data Frames
> str(sts)
R has the unfortunate habit of trying to turn vectors of character strings into factors (categorical data).
> sts$state = as.character(sts$state)
Manipulating Data: Operators
Arithmetic: + - * / ^
Comparison
< less than
> greater than
<= less than or equal to
>= greater than or equal to
== is equal to
!= is not equal to
Logical
! not
& and
| or
xor() exclusive or
Viewing Data: Using Operators
Viewing subsets of data using column names and operators:
> sts[sts$state == “Washington”,]
> sts[sts$Illiteracy >= 1.0,]
> sts$state[sts$Area > 100000]
> sts$state[sts$Life.Exp > 70]
Analyzing Data
What do the variables look like?
> table(sts$Illiteracy)
> hist(sts$Area)
> mean(sts$Life.Exp)
> sd(sts$Life.Exp)
> cor(sts$Illiteracy, sts$HS.Grad)
> mean(sts$Income[sts$Illiteracy >= 1.0])
Manipulating Data
Transforming variables:
> Pop.Density = sts$Population/sts$Area
This creates a new vector called Pop.Density of length 50 (our number of cases).
Manipulating Data
We can use Pop.Density without “adding” it to our dataframe.
But if you like the rectangular dataset concept, you can column bind it to the existing dataframe:
> sts = cbind(sts, Pop.Density)
Data Analysis
Hypothesis Testing
t.test, prop.test
Regression
lm(), glm()
Data Analysis: OLS Regression
> m1 = lm(Income ~ Illiteracy + log(Pop.Density) + HS.Grad + Murder, data=sts)
The output of the regression is also an object. We’ve named it m1.
> summary(m1)
Saving Data
You can use write.csv() or write.table() to save your dataset.
When you quit R, it will ask if you want to save the workspace. This includes all the objects you have created, but it does not include the code you’ve written. You can also use save.image() to save the workspace.
You should always save your code in a *.r file.
Other Useful Functions
> ifelse()
> is.na()
> match()
> merge()
> apply()
> order()
> sort()
Advanced Topics More on factors Lists (data type) Loops String manipulation Writing your own functions Graphics