a gentle introduction to r

Upload: gautam-saha

Post on 07-Apr-2018

218 views

Category:

Documents


0 download

TRANSCRIPT

  • 8/3/2019 A Gentle Introduction to R

    1/30

    A Gentle Introduction to R

    Michael A. Saum

    School of Science and TechnologyGeorgia Gwinnett College

    Lawrenceville, Georgia

    October 3, 2011

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 1 / 30

  • 8/3/2019 A Gentle Introduction to R

    2/30

    Outline

    1 Introduction

    2 The R Environment

    3 Variables and Operators

    4 Input and Output

    5 Simple Functions and Programming

    6 A Simple Application

    7 Concluding Remarks

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 2 / 30

  • 8/3/2019 A Gentle Introduction to R

    3/30

    Introduction What is R?

    Background

    R is a software application which combines a rich set of numerical andstatistical algorithms along with good visualization capabilities.

    R is an extension of the S and Splus languages.

    R is available for a wide variety of hardware and software platforms.

    R is Open Source and is freely available on the web, with many usercontributed packages also available.

    Other statistical analysis tools such as Matlab, SPSS, SAS, Stata,and Excel all cost $$.

    Note:Comparisons of statistical analysis software are located here:http://brenocon.com/blog/2009/02/comparison-of-data-analysis-packages-r-matlab-scipy-excel-sas-spss-stata/and http://en.wikipedia.org/wiki/Comparison_of_statistical_packages .

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 3 / 30

    http://brenocon.com/blog/2009/02/comparison-of-data-analysis-packages-r-matlab-scipy-excel-sas-spss-stata/http://en.wikipedia.org/wiki/Comparison_of_statistical_packageshttp://en.wikipedia.org/wiki/Comparison_of_statistical_packageshttp://en.wikipedia.org/wiki/Comparison_of_statistical_packageshttp://brenocon.com/blog/2009/02/comparison-of-data-analysis-packages-r-matlab-scipy-excel-sas-spss-stata/
  • 8/3/2019 A Gentle Introduction to R

    4/30

    Introduction What is R?

    Similarities with Matlab

    Both R and Matlab share common attributes:Interpreted language

    Interactive console

    Script and user defined function capability

    Easily extensible, with many user contributed packagesObject based operations (Vectors, Matrices, Data Frames, linearregression models)

    Note:

    For a nice detailed R - Matlab comparison, seehttp://mathesaurus.sourceforge.net/octave-r.html.

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 4 / 30

    I d i Wh i ?

    http://mathesaurus.sourceforge.net/octave-r.htmlhttp://mathesaurus.sourceforge.net/octave-r.htmlhttp://mathesaurus.sourceforge.net/octave-r.html
  • 8/3/2019 A Gentle Introduction to R

    5/30

    Introduction What is R?

    Differences with Matlab

    The R program is different from Matlab primarily in the following areas:

    R has a more robust set of statistical packages available than Matlab

    The R language utilizes different syntax, which takes some gettingused to

    Graphics capabilities are not as user friendly as those in Matlab

    R is FREE!

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 5 / 30

    I t d ti Wh t i R?

  • 8/3/2019 A Gentle Introduction to R

    6/30

    Introduction What is R?

    When to use R

    Here are some guidelines I use to determine if R is the right tool for thejob:

    Have to do quick, preliminary data analysis and Matlab is notavailable.

    Data requires sophisticated statistical analysis.

    Data is stored in a large database (e.g., MySQL) and requiresstatistical analysis.

    There is a package of routines which already exists and I need to useit.

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 6 / 30

    Introduction What is R?

  • 8/3/2019 A Gentle Introduction to R

    7/30

    Introduction What is R?

    When not to use R

    While R can be used for most data analysis needs, there are times where Rmay not be the best tool for the job. This includes:

    Huge (Gigabytes of data) datasets to process.

    Large display datasets (for example, a sequence of three dimensionalsurfaces).

    Plotting interactively in three dimensions is not Rs strongest point.

    Large scale programming projects.

    Parallel programming.

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 7 / 30

    Introduction What is R?

  • 8/3/2019 A Gentle Introduction to R

    8/30

    Introduction What is R?

    Resources on the Web

    The main R information page http://www.r-project.org/ .

    The Comprehensive R Archive Network (CRAN) http://cran.r-project.org/ . You can download the latest version of R fromthis site. A list of available packages is also on this site.

    R wiki http://wiki.r-project.org/ .

    Tips and Tricks http://pj.freefaculty.org/R/Rtips.html . This is a greatcollection of miscellaneous tips and tricks. If you dont know how to do something,this is a good place to see if it has been done before (and was documented). Thiswas also at one time mirrored on the R wiki.

    R Cheat Sheet

    http://cran.r-project.org/doc/contrib/Short-refcard.pdf . Excellentsummary of the main commands. I suggest printing it out and keeping it handywhen working in R.

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 8 / 30

    The R Environment User Interface

    http://www.r-project.org/http://www.r-project.org/http://cran.r-project.org/http://cran.r-project.org/http://wiki.r-project.org/http://wiki.r-project.org/http://pj.freefaculty.org/R/Rtips.htmlhttp://pj.freefaculty.org/R/Rtips.htmlhttp://cran.r-project.org/doc/contrib/Short-refcard.pdfhttp://cran.r-project.org/doc/contrib/Short-refcard.pdfhttp://cran.r-project.org/doc/contrib/Short-refcard.pdfhttp://pj.freefaculty.org/R/Rtips.htmlhttp://wiki.r-project.org/http://cran.r-project.org/http://www.r-project.org/
  • 8/3/2019 A Gentle Introduction to R

    9/30

    The R Environment User Interface

    Questions to Answer, Choices to be Made

    Linux, Mac, or Windows? There are peculiarities with each interface.You will have to become comfortable and familiar with the systemyou use R on.

    Is R installed? If not, you will need to download (see the main Rinformation page on CRAN) and install.

    Interactive or batch? If you have a large set of data to analyze with acomplicated (or long) sequence of R commands, it usually is moreefficient to put the program into a file and run R in Batch mode as ascript. More on this later.

    Output Choices? What type of output is desired? Do I need graphicsgenerated? In what format? What type of data output will begenerated?

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 9 / 30

    The R Environment User Interface

  • 8/3/2019 A Gentle Introduction to R

    10/30

    The R Environment User Interface

    The R console window

    The initial console windowR version 2.11.1 (2010-05-31)Copyright (C) 2010 The R Foundation for Statistical Computing

    ISBN 3-900051-07-0

    R is free software and comes with ABSOLUTELY NO WARRANTY.

    You are welcome to redistribute it under certain conditions.

    Type license() or licence() for distribution details.

    R is a collaborative project with many contributors.

    Type contributors() for more information and

    citation() on how to cite R or R packages in publications.

    Type demo() for some demos, help() for on-line help, or

    help.start() for an HTML browser interface to help.

    Type q() to quit R.

    >

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 10 / 30

    The R Environment User Interface

  • 8/3/2019 A Gentle Introduction to R

    11/30

    Getting help

    General Helphelp()

    HTML based Help

    help.start()

    Specific Help

    help(plot)

    Search Help

    help.search("plot")

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 11 / 30

    The R Environment User Interface

  • 8/3/2019 A Gentle Introduction to R

    12/30

    Quitting R

    I cant quit R!

    > q

    function (save = "default", status = 0, runLast = TRUE).Internal(quit(save, status, runLast))

    >

    The proper way> q()

    Save workspace image? [y/n/c]:

    Answering n will not save any of the variables changed during the

    session.Answering y will save all of the variables currently defined and willbring them into memory the next time R is started up.

    Answering c will continue the R session.

    quit()also works.

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 12 / 30

    Variables and Operators

  • 8/3/2019 A Gentle Introduction to R

    13/30

    Variables in general

    Assignment of data to variables is done with the

  • 8/3/2019 A Gentle Introduction to R

    14/30

    Workspace Variables

    Inspect the local variables

    > ls()

    [1] "a" "b" "c"

    Note that here there are three local variables defined.Remove all local variables

    rm(list=ls());ls()

    character(0)

    Note that there are now no local variables defined. I utilize this statementas the first statment of most of my R program files.

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 14 / 30

    Variables and Operators

  • 8/3/2019 A Gentle Introduction to R

    15/30

    Scalars and Vectors

    Assigning values to variables

    > a b c a

    [1] 1.4> b

    [1] 1.5 1.6 1.8

    > c

    [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25

    [26] 26 27 28 29 30

    Note that just typing the variables name will print out what is contained init (just like Matlab).

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 15 / 30

    Variables and Operators

  • 8/3/2019 A Gentle Introduction to R

    16/30

    Matrices

    Assigning values to variables

    > mdat mdatC.1 C.2 C.3

    row1 1 2 3

    row2 11 12 13

    Note the flexible method of passing named parameters to the function.

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 16 / 30

    Variables and Operators

  • 8/3/2019 A Gentle Introduction to R

    17/30

    Data Frames

    data.frames are tightly coupled collections of variables which sharemany of the properties of matrices and of lists, used as thefundamental data structure by most of Rs modeling software.

    One can combine multiple vectors containing different types of data(character, real, integer, boolean) into one data.frame as long as thelengths of all of the vectors are the same. See the cbind() command.

    Accesing a named column of a data frame as a single vector is easy.For example, data frame A has a column named height. Then one

    would type A$height to access that data as a vector.

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 17 / 30

    Variables and Operators

  • 8/3/2019 A Gentle Introduction to R

    18/30

    Miscellaneous

    The only way to learn R is to experiment with R.

    Utilize the web to search for similar examples.

    Utilize the examples present with most help topics.

    Be patient, and dont try to work with large datasets the first time.

    Use the 4 page R Reference Card.

    Know your data and have clear goals on what you are trying toachieve.

    Excellent graphics capability are contained in the lattice package.

    An very good reference is Lattice: Multivariate Data Visualizationwith R by Deepayan Sarkar.

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 18 / 30

    Input and Output

  • 8/3/2019 A Gentle Introduction to R

    19/30

    Reading in from a dataset

    read.table reads a file in table format (rows and columns) andcreates a data frame from it, with observations corresponding to linesand variables to fields (columns) in the file.

    read.csv reads a file in .csv format and creates a data frame from it,with observations corresponding to lines and variables to fields(columns) in the file.

    It has many different options, but it is quite flexible.

    This is the best way to get tabular data into R from external sources.

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 19 / 30

    Input and Output

  • 8/3/2019 A Gentle Introduction to R

    20/30

    Writing out to a dataset

    sink("myfilename") tells R to direct all of its subsequent output tothe file called myfilename.

    sink() stops output going there.Very useful to create a dataset which can be used as input to otherprograms, as well as outputting summary information from a scriptfile.

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 20 / 30

    Input and Output

  • 8/3/2019 A Gentle Introduction to R

    21/30

    Writing out a graphics file

    postscript("myfilename.ps",paper="letter",horizontal=TRUE)

    tells R to direct all of its subsequent graphics output to the file calledmyfilename.ps.

    dev.off() stops graphics output going there.pdf("myfilename.pdf",width=11.0,height=8.0) tells R to output to a.pdf file called myfilename.pdf.

    Very useful to create postscript (or pdf) pictures which can be included inother documents.

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 21 / 30

    Simple Functions and Programming

  • 8/3/2019 A Gentle Introduction to R

    22/30

    R Functions and simple loops

    A simple function

    > myfunc myfunc(7,3)

    [1] 10

    A simple loop

    > j for (i in c(1:10)) {

    + j j

    [1] 155

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 22 / 30

    Simple Functions and Programming

  • 8/3/2019 A Gentle Introduction to R

    23/30

    R CMD BATCH and command files

    Putting a sequence of R commands into a file (a script), one can thenrun those commands inside or outside of R.

    Suppose one has the file myprog.R. Then, to run this R file, one

    would type R CMD BATCH myprog.R from a command prompt of theoperating system.

    Graphical user interfaces may have different syntax such assource("myprog.R") but the principle is the same.

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 23 / 30

    A Simple Application The Problem

  • 8/3/2019 A Gentle Introduction to R

    24/30

    Old Faithful

    We would like to determine if there is a correlation in the following dataset (built in to R) as the faithful data frame.

    Waiting time between eruptions and the duration of the eruption forthe Old Faithful geyser in Yellowstone National Park, Wyoming, USA.

    A data frame with 272 observations on 2 variables. faithful$eruptions (numeric): Eruption time in mins faithful$waiting (numeric): Waiting time to next eruption (in mins)

    How to proceed?

    Note:To view all of the builtin data sets, type data().

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 24 / 30

    A Simple Application The Solution

  • 8/3/2019 A Gentle Introduction to R

    25/30

    Examine the faithful dataset

    What is it?

    > class(faithful)[1] "data.frame"

    What are the names of the column vectors of the dataframe?

    > names(faithful)

    [1] "eruptions" "waiting"

    What are the types and length of the column vectors?

    > class(faithful$eruptions)

    [1] "numeric"

    > length(faithful$eruptions)

    [1] 272

    > class(faithful$waiting)

    [1] "numeric"

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 25 / 30

    A Simple Application The Solution

  • 8/3/2019 A Gentle Introduction to R

    26/30

    Work with faithful dataset

    Print dataset to a file

    sink("faithful.dat")

    print(faithful)

    sink()

    Output of information

    cat("Column names: ",names(faithful),"\n")

    Column names: eruptions waiting

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 26 / 30

    A Simple Application The Solution

  • 8/3/2019 A Gentle Introduction to R

    27/30

    Is there a linear model?

    A linear regression model

    a.lm |t|)

    (Intercept) -1.874016 0.160143 -11.70

  • 8/3/2019 A Gentle Introduction to R

    28/30

    Information contained in the linear model

    lm attirbutes

    > attributes(a.lm)

    $names

    [1] "coefficients" "residuals" "effects" "rank"

    [5] "fitted.values" "assign" "qr" "df.residual"

    [9] "xlevels" "call" "terms" "model"

    $class[1] "lm"

    Plot the faithful data

    postscript("faithful.ps",paper="letter",horizontal=TRUE)

    plot(faithful$eruptions,faithful$waiting, main="Old Faithful",xlab="eruptions",ylab="waiting (min)")

    abline(a.lm)

    dev.off()

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 28 / 30

    A Simple Application The Solution

  • 8/3/2019 A Gentle Introduction to R

    29/30

    The Graph

    1.5 2.0 2.5 3.0 3.5 4.0 4.5 5.0

    50

    60

    70

    80

    90

    Old Faithful

    eruptions

    waiting(m

    in)

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 29 / 30

    Concluding Remarks

    S

  • 8/3/2019 A Gentle Introduction to R

    30/30

    Summary

    R is a very powerful and flexible system for analyzing data.While the syntax is a bit different from most programming languages,the basic concepts are present in this language also,.

    The only way to learn R is by using it.

    There may be times when one reaches a point where while dataanalysis can be done with R, it is better to switch to another method.

    Sometimes R is useful just to process parts of large data sets and sendthe output to another program to use.

    We have just touched the tip of the iceberg of how to use R.

    These are just suggestions, your mileage may vary.

    R is FREE!

    M. Saum (SST/GGC) A Gentle Introduction to R 10/03/11 30 / 30