r course 2014: lecture 1

Upload: gceid

Post on 02-Jun-2018

221 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/11/2019 R Course 2014: Lecture 1

    1/58

    Lecture 1: Overview ofWorkflow and Why R

    Ben Fanson

    Simeon Lisovski

  • 8/11/2019 R Course 2014: Lecture 1

    2/58

    Workshop Structure

    1) 10 weeks meeting weeklya) Tuesday, 10am, green room

    2) Sessions are broken up into 2 partsa) Lecture part (20 - 45min)

    b) Hands on section (20 - 45min)

    3) Informal setting (ask questions...)

  • 8/11/2019 R Course 2014: Lecture 1

    3/58

    Our backgroundBen Fanson Masters in Statistics

    converted to R (2 years ago) from SAS

    Simeon Lisovski

    R programmer for years has his own R package

  • 8/11/2019 R Course 2014: Lecture 1

    4/58

    Workshop Goals

    1) Development of an analysis workflow

    2) Become proficient in R fundamentals for d

    analysis

  • 8/11/2019 R Course 2014: Lecture 1

    5/58

    Disclaimer1) This is NOT a statistics workshop

    2) there are many, many ways to do the samin R. Our goal is that you know how to doone way.

    3) We do not focus on programming efficien

    4) This workshop is a work in progress...

  • 8/11/2019 R Course 2014: Lecture 1

    6/58

    Lecture 1 Goals1) convince you that you should care

    2) overview of data analysis workflow

    3) Why R?

    4) R editors (Rstudio)

    5) Getting started with R

  • 8/11/2019 R Course 2014: Lecture 1

    7/58

    Workflow: Reasons to C

    1) Research more reproducible

    2) Efficiency

    3) Trend towards submitting code

  • 8/11/2019 R Course 2014: Lecture 1

    8/58

    Reasons to Care

    1) Research more reproducible

    2) Efficiency

    3) Trend towards submitting code

  • 8/11/2019 R Course 2014: Lecture 1

    9/58

    easons to Care

  • 8/11/2019 R Course 2014: Lecture 1

    10/58

    Reasons to Care

    1) Research more reproducible

    2) Efficiency

    3) Trend towards submitting code

  • 8/11/2019 R Course 2014: Lecture 1

    11/58

    Reasons to CareEfficiency- ~80% of my data analysis is data cleaning, restructuring, s

    and visualizing.

    - Re-running analyses after finding errors, new transformatadding more data

    - reviewers requesting a specific analysis

    - easier to go back to old code and figure out your process

  • 8/11/2019 R Course 2014: Lecture 1

    12/58

    Reasons to Care1) Research more reproducible

    2) Efficiency

    3) Trend towards submitting code

    t C

  • 8/11/2019 R Course 2014: Lecture 1

    13/58

    easons to Care

    R script and journals

    endvai et al. 2013 Poc Roy Soc

    R code

    R gr

    t C

  • 8/11/2019 R Course 2014: Lecture 1

    14/58

    easons to Care

    R script and journals

    arn et al. 2014 Poc Roy Soc Population-level

    t C

  • 8/11/2019 R Course 2014: Lecture 1

    15/58

    easons to Care

    R script and journals

    arn et al. 2014 Poc Roy Soc Population-level

    ...(anot

  • 8/11/2019 R Course 2014: Lecture 1

    16/58

    Analysis Workflow

    Acquire/storeData

    Data Cleaning

    Reformatting

    Queries/merges

    Data Preparation

  • 8/11/2019 R Course 2014: Lecture 1

    17/58

    Analysis Workflow

    Acquire/storeData

    Data Cleaning

    Reformatting

    Queries/merges

    Data Preparation Analysis

    Statistical

    Methods

    Asse

  • 8/11/2019 R Course 2014: Lecture 1

    18/58

    Analysis Workflow

    Acquire/storeData

    Data Cleaning

    Reformatting

    Queries/merges

    Data Preparation Analysis

    Statistical

    Methods

    Asse

    Tables

    Figures

    Reports

    Datasets

    Write-up

  • 8/11/2019 R Course 2014: Lecture 1

    19/58

    Transparency- organized, logical, well documented- e.g. commented code, structured programs, organized folders

    Modularity- Keep scripts simple (not too many tasks per script), have re-usable

    in one location- e.g. file with project functions

    Portability- Make it easy to share scripts (collaborators, reviewers)- e.g. relative pathnames, nested folders

    http://stats.stackexchange.com/questions/2910/how-to-efficiently-manage-a-statistical-analysis-pro

    http://blog.revolutionanalytics.com/2010/10/a-workflow-for-r.html

    Attributes of a goodworkflow

  • 8/11/2019 R Course 2014: Lecture 1

    20/58

    Analysis Workflow

    Acquire/storeData

    Data Preparation

  • 8/11/2019 R Course 2014: Lecture 1

    21/58

    Lots of systems available

    Data management

    Text files

    Spreadsheets (e.g Excel)

    Relational databases

  • 8/11/2019 R Course 2014: Lecture 1

    22/58

    Good practices

    Data management

    1) Raw data should be read-onlya) Data should match lab notebook exactly (e.g. do not remov

    outliers/suspected errors)

    2) Separate tables for different response compone(e.g. morphology, behavior, physiology)

    a) Keep tables simple (help prevents data entering errors)

    3) Every data row should have unique identifiera) Minimize repeating of information (keeps table simpler an

    likely to misspell/mis-enter info)

  • 8/11/2019 R Course 2014: Lecture 1

    23/58

    Good practices

    Data management

    4) Keep track of all data files associated with a pand have a short description of what data thcontain

    5) Decide on naming conventions before collectdataa) e.g. male vs m vs. Male, Bactrocera tryoni vs. BT vs

    tryoni) *can fix with string functions in R but makes life e

  • 8/11/2019 R Course 2014: Lecture 1

    24/58

    Text files

    Data management

    Pros C

    Free Very difficu

    Small file size Too m

    No formatting No quFlexibility in output

    mainly for computer output

  • 8/11/2019 R Course 2014: Lecture 1

    25/58

    Spreadsheets

    Data management

    Pros

    Easy to get to software Columns are n

    Medium file size Columns can

    Tabular structure Informatiformatting c

    Easy for entering data Encourage

    re

    No q

  • 8/11/2019 R Course 2014: Lecture 1

    26/58

    Data management

    Co

    Columns are not l

    Columns can con

    Information aformatting cann

    Encourages no

    restru

    No qualit

  • 8/11/2019 R Course 2014: Lecture 1

    27/58

    Data management

    Co

    Columns are not l

    Columns can cont

    Information aformatting cann

    Encourages no

    restru

    No qualit

  • 8/11/2019 R Course 2014: Lecture 1

    28/58

    Data management

    Co

    Columns are not l

    Columns can con

    Information aformatting cann

    Encourages no

    restru

    Large datasets

    No qualit

  • 8/11/2019 R Course 2014: Lecture 1

    29/58

    Data management

    Co

    Columns are not l

    Columns can con

    Information aformatting cann

    Encourages no

    restru

    No qualit

  • 8/11/2019 R Course 2014: Lecture 1

    30/58

    Data management

    Co

    Columns are not l

    Columns can con

    Information aformatting cann

    Encourages no

    restru

    Large datasets

    No qualit

  • 8/11/2019 R Course 2014: Lecture 1

    31/58

    Databases

    Data management

    Pros CTables are defined rigorously Steep

    Columns can have only one data

    type

    lar

    A row is a single record and cannot

    be broken up

    Software can

    Tables are linked enforcing dataintegrity

    Data entry cthan sprea

    enter into

    creating fo

    en

    Data entry forms can be created

    Quality controls

    Rcourse project accdb

    http://localhost/var/www/apps/conversion/tmp/Rcourse_proj/Rcourse_Access_example.accdbhttp://localhost/var/www/apps/conversion/tmp/Rcourse_proj/Rcourse_Access_example.accdb
  • 8/11/2019 R Course 2014: Lecture 1

    32/58

    Analysis Workflow

    Acquire/store

    Data

    Data Cleaning

    Reformatting

    Queries/merges

    Preparation

  • 8/11/2019 R Course 2014: Lecture 1

    33/58

    Analysis Workflow

    Data Cleaning

    Reformatting

    Queries/merges

    Data Preparation Analysis

    Statistical

    Methods

    Asse

    Tables

    Figures

    Reports

    Datasets

    Write-up

    Why R and not other

  • 8/11/2019 R Course 2014: Lecture 1

    34/58

    Why R and not othersoftware

    1) Complete analysis software

    2) Free and open source

    3) Developed by statisticians (and used by)

    4) Popular (lots of contributions)

    (Note - Negative of this is that most packages are written by nonp

    so R code is often inefficient, inconsistent, and error messages areinformative)

    5) R is a functional programming language

    6) Lots of data types (e.g. image, spatial/GIS, genetics)

    Why R and not other

  • 8/11/2019 R Course 2014: Lecture 1

    35/58

    Why R and not othersoftware

    R vs SAS vs. SPSS vs. MATLAB

    (For a comparison http://stanfordphd.com/Statistical_Softwar

  • 8/11/2019 R Course 2014: Lecture 1

    36/58

    >4500 packages

    Number of Packages Over Time Ranking of Statistics So

  • 8/11/2019 R Course 2014: Lecture 1

    37/58

    GUIs/Editors

    1) R console [ default editor ] (not recommended)

    2) Rstudio [we will use this one]

    3) TinnR

    4) Vim-R

    5) Emacs

    6) Others

    http://stackoverflow.com/questions/1173463/recommendations-fotext-editor-for-r

  • 8/11/2019 R Course 2014: Lecture 1

    38/58

    Rstudio overviewComplete R interface

    a) Editor with colour coding and tab autofill/help pages

    b) Integrated plots

    c) R console

    d) History

    e) Version control (e.g. git or subversion)

    f) Variable list

    Rstudio example

    http://localhost/var/www/apps/conversion/tmp/scratch_3/lecture%201.Rprojhttp://localhost/var/www/apps/conversion/tmp/scratch_3/lecture%201.Rproj
  • 8/11/2019 R Course 2014: Lecture 1

    39/58

    Getting Started with RRprofile.site, .Rprofile

    These files are sourced on startup of R

    Rprofile.site is sourced first. e.g. C:\Program Files\R\R-3.1.0\etc\

    .Rprofile in Local (project location) is then sourced

  • 8/11/2019 R Course 2014: Lecture 1

    40/58

    Getting Started with Rwhat to put in Rprofile.site, .Rprofile

    Packages you commonly use [ e.g. library(ggplot2) ]

    I source my list of Rfunctions (stored in My Documents - which is pdata backing up)

    - For some more tips, see http://stackoverflow.com/questions/1189759/expert-r-useryour-rprofile

    - http://www.r-bloggers.com/customize-your-rprofile-and-keep-your-workspace-clean

  • 8/11/2019 R Course 2014: Lecture 1

    41/58

    R scriptsSetting up your Script

    1. Header describing file

    2. Global variablesa) e.g. db_dir

  • 8/11/2019 R Course 2014: Lecture 1

    42/58

    Header

    Global

    settings

    Section 1

    R Resources

  • 8/11/2019 R Course 2014: Lecture 1

    43/58

    Online Google

    Stackoverflow (general programmingstackoverflow.com)

    Cross validated (statistics - stats.stackexchange.com/) Quick-R (http://www.statmethods.net/)

    Books(plenty out there to choose from; lots of redundancy) R cookbook

    The R book (the big book) Data Analysis and Graphic using R

    A Beginners Guide to R ( by Zuur)

    (see http://www.r-bloggers.com/r-programming-books-updated/for more

    R Resources

    Next Week

    http://www.r-bloggers.com/r-programming-books-updated/http://www.r-bloggers.com/r-programming-books-updated/http://www.r-bloggers.com/r-programming-books-updated/http://www.r-bloggers.com/r-programming-books-updated/http://www.r-bloggers.com/r-programming-books-updated/http://www.r-bloggers.com/r-programming-books-updated/http://www.r-bloggers.com/r-programming-books-updated/http://www.r-bloggers.com/r-programming-books-updated/http://www.r-bloggers.com/r-programming-books-updated/http://www.r-bloggers.com/r-programming-books-updated/http://www.r-bloggers.com/r-programming-books-updated/http://www.r-bloggers.com/r-programming-books-updated/
  • 8/11/2019 R Course 2014: Lecture 1

    44/58

    R Core Concepts

    R as matrix languageR objects and classes

    Importing data

    Next Week

  • 8/11/2019 R Course 2014: Lecture 1

    45/58

  • 8/11/2019 R Course 2014: Lecture 1

    46/58

    Lecture 1: Hands on Sectio

    S t di t i f R

  • 8/11/2019 R Course 2014: Lecture 1

    47/58

    Setup directories for Rcourse_p

    1) create folder '/Rcourse_proj' [pick wherever you want]

    2) create folder '/Rcourse_proj/R programs'

    3) create folder '/Rcourse_proj/data'

    C t R t di j t

  • 8/11/2019 R Course 2014: Lecture 1

    48/58

    Create Rstudio project

    1) Open Rstudio

    2) Create new project associated with directory /Rcourse_pr

    I iti li R fil

    .

  • 8/11/2019 R Course 2014: Lecture 1

    49/58

    Initialize .Rprofile

    Create .Rprofile

    - in windows, this is slightly harder than it should be. Open Rcoure_following code

    sink('.Rprofile') # create a new file called .Rprofile

    sink() #this creates a file in your working directory (make sure directory

    Add following text to .Rprofile (open file in text editor to add

    printResult

  • 8/11/2019 R Course 2014: Lecture 1

    50/58

    Create an R scriptMake a new R script

    Save as 'R programs/lectu

    L d lib / k

  • 8/11/2019 R Course 2014: Lecture 1

    51/58

    Load a library/packageFirst install the package from CRAN (R website)...

    install.packages('ggplot2') # can use Tools>Install packages... in R

    Next load the package...

    library(ggplot2) # or require(ggplot2)example(ggplot) # see that is works

    B i R t

  • 8/11/2019 R Course 2014: Lecture 1

    52/58

    Basic R aspects1) commenting are indicated by #

    # this is a comment and will not be compiled

    2)

  • 8/11/2019 R Course 2014: Lecture 1

    53/58

    Basic R aspects5) Logical

    x

  • 8/11/2019 R Course 2014: Lecture 1

    54/58

    Common R mistakes1) '\' vs '/' in file pathsWindows uses '\', R uses '/'

    2) R is case sensitive: Mean() mean() # use autofill tab in Rstudio to this

    3) '=' vs. '==' # latter is comparing two items logicallyx = 5 # sets x to 5

    x == 10 # compares x to 10

    4) typing to next lineneed ',' or operator to indicate line is not compl

    x

  • 8/11/2019 R Course 2014: Lecture 1

    55/58

    Global attributes

    options()# get a list of current options

    options( stringAsFactors=TRUE ) # will explain this one in next lecture

    directories

    setwd('R programs') # set working directory to R programs

    getwd() # see what is your working directorydir() # list of contents of directory

    setwd('..') # go up a level

    dir.create('query') # create a new folder called queries in current folder

    Some useful functions/objects

  • 8/11/2019 R Course 2014: Lecture 1

    56/58

    Some useful functions/objects

    ls(), rm()# get a list of current options

    ls() # get a list of objects

    rm() # delete one or more object

    rep(), seq()

    rep(x=5, times=10) # repeat 5 ten times

    seq(from=0, to=1, by=0.1) # make a sequenceLETTERS, letters

    LETTTERS # capitalized alphabet

    letters # lowercase alphabet

    Some useful functions/objects

  • 8/11/2019 R Course 2014: Lecture 1

    57/58

    Some useful functions/objects

    month.abb, month.name

    month.abb # list of abbreviated month

    month.name # list of month names

    pi

    pi # 3.141593

    Functions Learned

  • 8/11/2019 R Course 2014: Lecture 1

    58/58

    Functions LearnedFunctions

    getwd() c()

    setwd() ls()library() class(), str()

    *, /, +,-,^,&,| dir(), list.files()

    options() dir.create()

    source() rep()

    rm() seq()