r course 2014: lecture 1
TRANSCRIPT
-
8/11/2019 R Course 2014: Lecture 1
1/58
Lecture 1: Overview ofWorkflow and Why R
Ben Fanson
Simeon Lisovski
-
8/11/2019 R Course 2014: Lecture 1
2/58
Workshop Structure
1) 10 weeks meeting weeklya) Tuesday, 10am, green room
2) Sessions are broken up into 2 partsa) Lecture part (20 - 45min)
b) Hands on section (20 - 45min)
3) Informal setting (ask questions...)
-
8/11/2019 R Course 2014: Lecture 1
3/58
Our backgroundBen Fanson Masters in Statistics
converted to R (2 years ago) from SAS
Simeon Lisovski
R programmer for years has his own R package
-
8/11/2019 R Course 2014: Lecture 1
4/58
Workshop Goals
1) Development of an analysis workflow
2) Become proficient in R fundamentals for d
analysis
-
8/11/2019 R Course 2014: Lecture 1
5/58
Disclaimer1) This is NOT a statistics workshop
2) there are many, many ways to do the samin R. Our goal is that you know how to doone way.
3) We do not focus on programming efficien
4) This workshop is a work in progress...
-
8/11/2019 R Course 2014: Lecture 1
6/58
Lecture 1 Goals1) convince you that you should care
2) overview of data analysis workflow
3) Why R?
4) R editors (Rstudio)
5) Getting started with R
-
8/11/2019 R Course 2014: Lecture 1
7/58
Workflow: Reasons to C
1) Research more reproducible
2) Efficiency
3) Trend towards submitting code
-
8/11/2019 R Course 2014: Lecture 1
8/58
Reasons to Care
1) Research more reproducible
2) Efficiency
3) Trend towards submitting code
-
8/11/2019 R Course 2014: Lecture 1
9/58
easons to Care
-
8/11/2019 R Course 2014: Lecture 1
10/58
Reasons to Care
1) Research more reproducible
2) Efficiency
3) Trend towards submitting code
-
8/11/2019 R Course 2014: Lecture 1
11/58
Reasons to CareEfficiency- ~80% of my data analysis is data cleaning, restructuring, s
and visualizing.
- Re-running analyses after finding errors, new transformatadding more data
- reviewers requesting a specific analysis
- easier to go back to old code and figure out your process
-
8/11/2019 R Course 2014: Lecture 1
12/58
Reasons to Care1) Research more reproducible
2) Efficiency
3) Trend towards submitting code
t C
-
8/11/2019 R Course 2014: Lecture 1
13/58
easons to Care
R script and journals
endvai et al. 2013 Poc Roy Soc
R code
R gr
t C
-
8/11/2019 R Course 2014: Lecture 1
14/58
easons to Care
R script and journals
arn et al. 2014 Poc Roy Soc Population-level
t C
-
8/11/2019 R Course 2014: Lecture 1
15/58
easons to Care
R script and journals
arn et al. 2014 Poc Roy Soc Population-level
...(anot
-
8/11/2019 R Course 2014: Lecture 1
16/58
Analysis Workflow
Acquire/storeData
Data Cleaning
Reformatting
Queries/merges
Data Preparation
-
8/11/2019 R Course 2014: Lecture 1
17/58
Analysis Workflow
Acquire/storeData
Data Cleaning
Reformatting
Queries/merges
Data Preparation Analysis
Statistical
Methods
Asse
-
8/11/2019 R Course 2014: Lecture 1
18/58
Analysis Workflow
Acquire/storeData
Data Cleaning
Reformatting
Queries/merges
Data Preparation Analysis
Statistical
Methods
Asse
Tables
Figures
Reports
Datasets
Write-up
-
8/11/2019 R Course 2014: Lecture 1
19/58
Transparency- organized, logical, well documented- e.g. commented code, structured programs, organized folders
Modularity- Keep scripts simple (not too many tasks per script), have re-usable
in one location- e.g. file with project functions
Portability- Make it easy to share scripts (collaborators, reviewers)- e.g. relative pathnames, nested folders
http://stats.stackexchange.com/questions/2910/how-to-efficiently-manage-a-statistical-analysis-pro
http://blog.revolutionanalytics.com/2010/10/a-workflow-for-r.html
Attributes of a goodworkflow
-
8/11/2019 R Course 2014: Lecture 1
20/58
Analysis Workflow
Acquire/storeData
Data Preparation
-
8/11/2019 R Course 2014: Lecture 1
21/58
Lots of systems available
Data management
Text files
Spreadsheets (e.g Excel)
Relational databases
-
8/11/2019 R Course 2014: Lecture 1
22/58
Good practices
Data management
1) Raw data should be read-onlya) Data should match lab notebook exactly (e.g. do not remov
outliers/suspected errors)
2) Separate tables for different response compone(e.g. morphology, behavior, physiology)
a) Keep tables simple (help prevents data entering errors)
3) Every data row should have unique identifiera) Minimize repeating of information (keeps table simpler an
likely to misspell/mis-enter info)
-
8/11/2019 R Course 2014: Lecture 1
23/58
Good practices
Data management
4) Keep track of all data files associated with a pand have a short description of what data thcontain
5) Decide on naming conventions before collectdataa) e.g. male vs m vs. Male, Bactrocera tryoni vs. BT vs
tryoni) *can fix with string functions in R but makes life e
-
8/11/2019 R Course 2014: Lecture 1
24/58
Text files
Data management
Pros C
Free Very difficu
Small file size Too m
No formatting No quFlexibility in output
mainly for computer output
-
8/11/2019 R Course 2014: Lecture 1
25/58
Spreadsheets
Data management
Pros
Easy to get to software Columns are n
Medium file size Columns can
Tabular structure Informatiformatting c
Easy for entering data Encourage
re
No q
-
8/11/2019 R Course 2014: Lecture 1
26/58
Data management
Co
Columns are not l
Columns can con
Information aformatting cann
Encourages no
restru
No qualit
-
8/11/2019 R Course 2014: Lecture 1
27/58
Data management
Co
Columns are not l
Columns can cont
Information aformatting cann
Encourages no
restru
No qualit
-
8/11/2019 R Course 2014: Lecture 1
28/58
Data management
Co
Columns are not l
Columns can con
Information aformatting cann
Encourages no
restru
Large datasets
No qualit
-
8/11/2019 R Course 2014: Lecture 1
29/58
Data management
Co
Columns are not l
Columns can con
Information aformatting cann
Encourages no
restru
No qualit
-
8/11/2019 R Course 2014: Lecture 1
30/58
Data management
Co
Columns are not l
Columns can con
Information aformatting cann
Encourages no
restru
Large datasets
No qualit
-
8/11/2019 R Course 2014: Lecture 1
31/58
Databases
Data management
Pros CTables are defined rigorously Steep
Columns can have only one data
type
lar
A row is a single record and cannot
be broken up
Software can
Tables are linked enforcing dataintegrity
Data entry cthan sprea
enter into
creating fo
en
Data entry forms can be created
Quality controls
Rcourse project accdb
http://localhost/var/www/apps/conversion/tmp/Rcourse_proj/Rcourse_Access_example.accdbhttp://localhost/var/www/apps/conversion/tmp/Rcourse_proj/Rcourse_Access_example.accdb -
8/11/2019 R Course 2014: Lecture 1
32/58
Analysis Workflow
Acquire/store
Data
Data Cleaning
Reformatting
Queries/merges
Preparation
-
8/11/2019 R Course 2014: Lecture 1
33/58
Analysis Workflow
Data Cleaning
Reformatting
Queries/merges
Data Preparation Analysis
Statistical
Methods
Asse
Tables
Figures
Reports
Datasets
Write-up
Why R and not other
-
8/11/2019 R Course 2014: Lecture 1
34/58
Why R and not othersoftware
1) Complete analysis software
2) Free and open source
3) Developed by statisticians (and used by)
4) Popular (lots of contributions)
(Note - Negative of this is that most packages are written by nonp
so R code is often inefficient, inconsistent, and error messages areinformative)
5) R is a functional programming language
6) Lots of data types (e.g. image, spatial/GIS, genetics)
Why R and not other
-
8/11/2019 R Course 2014: Lecture 1
35/58
Why R and not othersoftware
R vs SAS vs. SPSS vs. MATLAB
(For a comparison http://stanfordphd.com/Statistical_Softwar
-
8/11/2019 R Course 2014: Lecture 1
36/58
>4500 packages
Number of Packages Over Time Ranking of Statistics So
-
8/11/2019 R Course 2014: Lecture 1
37/58
GUIs/Editors
1) R console [ default editor ] (not recommended)
2) Rstudio [we will use this one]
3) TinnR
4) Vim-R
5) Emacs
6) Others
http://stackoverflow.com/questions/1173463/recommendations-fotext-editor-for-r
-
8/11/2019 R Course 2014: Lecture 1
38/58
Rstudio overviewComplete R interface
a) Editor with colour coding and tab autofill/help pages
b) Integrated plots
c) R console
d) History
e) Version control (e.g. git or subversion)
f) Variable list
Rstudio example
http://localhost/var/www/apps/conversion/tmp/scratch_3/lecture%201.Rprojhttp://localhost/var/www/apps/conversion/tmp/scratch_3/lecture%201.Rproj -
8/11/2019 R Course 2014: Lecture 1
39/58
Getting Started with RRprofile.site, .Rprofile
These files are sourced on startup of R
Rprofile.site is sourced first. e.g. C:\Program Files\R\R-3.1.0\etc\
.Rprofile in Local (project location) is then sourced
-
8/11/2019 R Course 2014: Lecture 1
40/58
Getting Started with Rwhat to put in Rprofile.site, .Rprofile
Packages you commonly use [ e.g. library(ggplot2) ]
I source my list of Rfunctions (stored in My Documents - which is pdata backing up)
- For some more tips, see http://stackoverflow.com/questions/1189759/expert-r-useryour-rprofile
- http://www.r-bloggers.com/customize-your-rprofile-and-keep-your-workspace-clean
-
8/11/2019 R Course 2014: Lecture 1
41/58
R scriptsSetting up your Script
1. Header describing file
2. Global variablesa) e.g. db_dir
-
8/11/2019 R Course 2014: Lecture 1
42/58
Header
Global
settings
Section 1
R Resources
-
8/11/2019 R Course 2014: Lecture 1
43/58
Online Google
Stackoverflow (general programmingstackoverflow.com)
Cross validated (statistics - stats.stackexchange.com/) Quick-R (http://www.statmethods.net/)
Books(plenty out there to choose from; lots of redundancy) R cookbook
The R book (the big book) Data Analysis and Graphic using R
A Beginners Guide to R ( by Zuur)
(see http://www.r-bloggers.com/r-programming-books-updated/for more
R Resources
Next Week
http://www.r-bloggers.com/r-programming-books-updated/http://www.r-bloggers.com/r-programming-books-updated/http://www.r-bloggers.com/r-programming-books-updated/http://www.r-bloggers.com/r-programming-books-updated/http://www.r-bloggers.com/r-programming-books-updated/http://www.r-bloggers.com/r-programming-books-updated/http://www.r-bloggers.com/r-programming-books-updated/http://www.r-bloggers.com/r-programming-books-updated/http://www.r-bloggers.com/r-programming-books-updated/http://www.r-bloggers.com/r-programming-books-updated/http://www.r-bloggers.com/r-programming-books-updated/http://www.r-bloggers.com/r-programming-books-updated/ -
8/11/2019 R Course 2014: Lecture 1
44/58
R Core Concepts
R as matrix languageR objects and classes
Importing data
Next Week
-
8/11/2019 R Course 2014: Lecture 1
45/58
-
8/11/2019 R Course 2014: Lecture 1
46/58
Lecture 1: Hands on Sectio
S t di t i f R
-
8/11/2019 R Course 2014: Lecture 1
47/58
Setup directories for Rcourse_p
1) create folder '/Rcourse_proj' [pick wherever you want]
2) create folder '/Rcourse_proj/R programs'
3) create folder '/Rcourse_proj/data'
C t R t di j t
-
8/11/2019 R Course 2014: Lecture 1
48/58
Create Rstudio project
1) Open Rstudio
2) Create new project associated with directory /Rcourse_pr
I iti li R fil
.
-
8/11/2019 R Course 2014: Lecture 1
49/58
Initialize .Rprofile
Create .Rprofile
- in windows, this is slightly harder than it should be. Open Rcoure_following code
sink('.Rprofile') # create a new file called .Rprofile
sink() #this creates a file in your working directory (make sure directory
Add following text to .Rprofile (open file in text editor to add
printResult
-
8/11/2019 R Course 2014: Lecture 1
50/58
Create an R scriptMake a new R script
Save as 'R programs/lectu
L d lib / k
-
8/11/2019 R Course 2014: Lecture 1
51/58
Load a library/packageFirst install the package from CRAN (R website)...
install.packages('ggplot2') # can use Tools>Install packages... in R
Next load the package...
library(ggplot2) # or require(ggplot2)example(ggplot) # see that is works
B i R t
-
8/11/2019 R Course 2014: Lecture 1
52/58
Basic R aspects1) commenting are indicated by #
# this is a comment and will not be compiled
2)
-
8/11/2019 R Course 2014: Lecture 1
53/58
Basic R aspects5) Logical
x
-
8/11/2019 R Course 2014: Lecture 1
54/58
Common R mistakes1) '\' vs '/' in file pathsWindows uses '\', R uses '/'
2) R is case sensitive: Mean() mean() # use autofill tab in Rstudio to this
3) '=' vs. '==' # latter is comparing two items logicallyx = 5 # sets x to 5
x == 10 # compares x to 10
4) typing to next lineneed ',' or operator to indicate line is not compl
x
-
8/11/2019 R Course 2014: Lecture 1
55/58
Global attributes
options()# get a list of current options
options( stringAsFactors=TRUE ) # will explain this one in next lecture
directories
setwd('R programs') # set working directory to R programs
getwd() # see what is your working directorydir() # list of contents of directory
setwd('..') # go up a level
dir.create('query') # create a new folder called queries in current folder
Some useful functions/objects
-
8/11/2019 R Course 2014: Lecture 1
56/58
Some useful functions/objects
ls(), rm()# get a list of current options
ls() # get a list of objects
rm() # delete one or more object
rep(), seq()
rep(x=5, times=10) # repeat 5 ten times
seq(from=0, to=1, by=0.1) # make a sequenceLETTERS, letters
LETTTERS # capitalized alphabet
letters # lowercase alphabet
Some useful functions/objects
-
8/11/2019 R Course 2014: Lecture 1
57/58
Some useful functions/objects
month.abb, month.name
month.abb # list of abbreviated month
month.name # list of month names
pi
pi # 3.141593
Functions Learned
-
8/11/2019 R Course 2014: Lecture 1
58/58
Functions LearnedFunctions
getwd() c()
setwd() ls()library() class(), str()
*, /, +,-,^,&,| dir(), list.files()
options() dir.create()
source() rep()
rm() seq()