introduction to r carol bult the jackson laboratory functional genomics (bmb550) spring 2011

64
Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Upload: julia-morris

Post on 29-Jan-2016

219 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Introduction to R

Carol BultThe Jackson Laboratory

Functional Genomics (BMB550)Spring 2011

Page 2: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

http://www.r-project.org/

R Project for Statistical Computing

Page 3: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

What is R?

• R is an integrated suite of software facilities for data manipulation, calculation, and graphical display. It includes:– An effective data handling and storage facility,– A suite of operators for calculations on arrays (matrices),– A large coherent integrated collection of intermediate

tools for data analysis,– Graphical facilities for data analysis and display either on-

screen or on hardcopy, and– A well-developed, simple, and effective programming

language which includes conditionals, loops, user-defined recursive functions and input and output facilities.

From: http://www.r-project.org/index.html

Page 4: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

http://www.fsf.org/licensing/licenses/quick-guide-gplv3.html

R is available for free under the GNU General Public License

Page 5: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

http://www.fsf.org/

R is an official part of the Free Software Foundation

Page 6: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

BioConductor

http://www.bioconductor.org/

BioConductor is a project to generate code for analyzing genomic data. Most of the components of BioConductor are written in R.

Page 7: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Installing R

Guidance on downloading and installing R can be found on the R home page under the FAQ link.

Page 8: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

CRAN Mirrors

CRAN stands for “Comprehensive R Archive Network”

Each CRAN mirror should all have the same versions of the R software and various analysis packages used in R.

Usually you pick a CRAN mirror that is closest to you geographically to use as the source for downloading your R code and packages.

Page 9: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

What is an R “package”?

• An R package is data analysis code that is written so that it can be executed in the R environment

• Several basic statistics packages are supplied with the basic R distribution

• Additional packages can be obtained from the CRAN sites– You will likely need to download additional packages

that deal specifically with microarray analysis for this course

Page 10: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Let’s Get Started…

http://www.cloud.target.maine.edu/

Page 11: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

http://sourceforge.net/projects/xming/

Xming is an X server for Windows

Page 12: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Use PuTTY to access the cloud server

1. Enter the IP address and select SSH 2. Under SSH select Tunnels

and make sure the Enable X11 forwarding is selected

Page 13: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

With the Xming X server running and the X11 forwarding set in PuTTY, you should now be able to launch windows from PuTTY.

To test if you X server is running properly….

At the $ prompt in your PuTTY window, type matlab and see if the matlab window launches.

From within R, type help.start() from the > prompt and see if a web browser window launches.

Page 14: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011
Page 15: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Once logged in, type R at the $ prompt to start your R session

Page 16: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

You can invoke on-line help with the help.start() command.

Note: R is a case sensitive language! What happens if you type Help.Start() at the command prompt?

Page 17: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

At the prompt (>) type the command , library(), to see which packages are available in your R installation.

A library is a location where R goes to find packages.

What is listed for you may differ from what is shown here.

Page 18: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

•The caret “>” is the command prompt•‘library()’ is the command•the parantheses are for defining specific operations associated with the function.

You might find this R reference card helpful….Use it as a stub to create your own reference card!

http://www.psych.upenn.edu/~baron/refcard.pdf

Page 19: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Use this command to get a list of functions associated with the stats package.

Page 20: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Use this command to launch detailed documentation for a package.

Page 21: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011
Page 22: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Running this command will load all of the data sets that come with your R installation.

Page 23: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

In this series of commands, we load the BOD (Biological Oxygen Demand) data set and then print out the data to the screen.

Page 24: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Data Frames in R

• The data sets in R are objects called data frames– You can think of a data frame as a table where the

columns are variables and rows are observations

Page 25: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Types of Objects in R

• Vector– One-dimensional array of arbitrary length. All members of a vector must be of the

same type (numeric, alpha, etc.)• Matrix

– Two –dimensional array with an arbitrary number of rows and columns. All elements in a matrix must be of the same type.

• Array– Similar to a matrix but with an arbitrary dimension

• Data frame– Organized similar to a matrix except that each column can include its own type of

data. Not all columns in a data frame need to contain the same type of data.• Function

– A type of R object that performs a specific operation. R contains many built-in functions.

• List– An arbitrary collection of R objects

Page 26: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Create the BOD Data Frame from Scratch1. Create a vector object for time using the c() function.

2. Create a vector object for demand.

3. Use the data.frame() function to create the data frame object

Page 27: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Reading in data using the scan() commandIf you have a simple text file of values (no headers), you can read those data in using the scan() function.

In this example, there is a text file called scandata.txt in the working directory/folder that I pointed R to when I first started up the program.

To read in structured data files…i.e., ones with multiple columns of data and headers, etc. use the read.table() command instead of scan().

Page 28: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Use the read.table() function to read in data from a text file and use it as a data frame in R.

Other input functions for R include:read.csv()read.delim()

Reading in Data from Files

If you had data in an Excel spreadsheet, how could you import it into R?

<- can be used as an assignment operator, but = should also work.

Page 29: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

How would you find out more about the c() function?

Find the details behind the data in the BOD data frame.

Page 30: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Writing Output to Files

Want to save the data in your data frame as a text file on your computer?

Use the write.table() function to output the MyBOD data frame to a text file.

This file will be saved to the directory that R is working from.

Page 31: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Writing Output to Files

Use the write.csv() function to output the MyBOD data frame to a comma separated file (which can be opened easily in Excel).

This file will be saved to the directory that R is working from.

Page 32: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Editing Data

For smaller data files, you can edit the file using the edit() function. This launches an R Editor window.

Always write the edited file to a new object name!

In this case we will edit the newdata object and store the results as an object called Mynewdata. To store the edited object as a file on your computer, use the write.table() function.

Page 33: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Exploring What is In a Data Frame

The names() and str() commands let you get an overview of what is in a data frame.

The names() function allows you to access the column names and edit them. In this code snippet, the first [1] and second [2] column names are changed from lower case to sentence case.

Page 34: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Accessing Data in a Data Frame

Use the name of the data frame (MyBOD2), a dollar sign ($) and the name of the variable (Time or Demand) to see a list of all of the observation values.

To access a specific value, you simply indicate the position in the vector…for example, MyBOD2$Demand [2] will access the second value for that variable which is 10.3.

If you “attach” the data frame using the attach() command you can access the variables and observations without the cumbersome need to specify the name of the data frame or the $.

Using what you know…how could you change the value of Demand[2] from 10.3 to 10.5? (be careful that you don’t make such changes to the original data frames!)

Page 35: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Adding columns to a Data Frame

You can add and delete columns from a data frame.

Here we add a column for the sex of whatever it is we are measuring oxygen demand for.

Oops!!! We have a data entry error. The value for sex should all be female (F). How would you fix this?

Page 36: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Deleting columns from a Data Frame

You can delete columns from a data frame.

Here we deleted the column for sex that we just created from the MyBOD2 data frame.

Page 37: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Note: When you are done using a data frame it is a good practice to “detach” it.

Page 38: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Displaying Data in RR comes with an incredible array of built in data analysis tools for exploring, analyzing, and visualizing data.

Here is a plot of the Time and Demand variables for the MyBOD2 data frame using the plot() command.

Note that because we “attached” this data frame we can just use the names of the variables to access the observation data.

Use help(plot) to look up the details of this command. Figure out how to change the command to add a title to the plot.

Page 39: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Displaying Data in R

Here is a box plot of the Demand variables for the MyBOD2 data frame using the boxplot() command.

Page 40: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Analyzing Data

The summary() command provides summary statistics for a data frame.

Page 41: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Analyzing DataHere are a series of commands to generate some basic statistics for the Demand variable in the MyBOD2 data frame.

The data frame has been attached so that the variable names can be used directly.

Remember that the case of the variable names were changed relative to the original BOD data set (Time vs time; Demand vs demand)!

Page 42: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Examples of stats functions in R

• mean()• median()• table() – there is no function to find the mode of a data set

but the table() function will show how many times a value is observed.

• max()• min()• There is no built in function for midrange so you have to

construct a formula to calculate this based on the values from the max() and min() functions.

Page 43: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Measuring data spread

Remember that the case of the variable names were changed relative to the original BOD data set (Time vs time; Demand vs demand)!

Here are a series of commands to generate some basic statistics related to the spread of measurements for the Demand variable in the MyBOD2 data frame.

The data frame has been attached so that the variable names can be used directly.

Page 44: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

More examples of stats functions in R

• var()• sd()• There is no built in function for calculating the standard error

of the mean (sem) so you have to create a formula to calculate this.

• There is no built in function for calculating the range so you have to construct a formula to calculate this based on the values from the max() and min() functions.

Page 45: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

What is meant by mode?

What do the variance, standard deviation and standard error of the mean tell us about a data set?

Page 46: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Your Turn

Create a data frame for age and frequency using the data on this slide.

Calculate the cumulative frequency and add it as a column to the data frame.

Save the data frame as a comma separated text file and then open it in Excel.

Plot age versus cumulative frequency.

What are mean and median age?

What is the variance, standard deviation, and standard error mean for frequency?

Page 47: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Create a data frame for age and frequency using the data on this slide.

Page 48: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Calculate the cumulative frequency and add it as a column to the data frame.

Page 49: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Save the data frame as a comma separated text file and then open it in Excel.

Page 50: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Plot age versus cumulative frequency.

Page 51: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

What are mean and median age?

What is the variance, standard deviation, and standard error mean for frequency?

Page 52: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Creating an R script.

Instead of typing one command per prompt, you can type all commands in a text file and run your data analysis process as an R script

R ignores lines that start with #.

You can use this feature to add comments about your R code.

Page 53: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Example of a well documented set of R commands that can be stored as a file.

Page 54: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Write your age/frequency data frame assignment as an R script.

Store the script in the directory where you keep all of your R work.

Figure out how to run the script without cutting and pasting it into the interface!

Page 55: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Tip of the Day: The edit() function

Page 56: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Installing R Packages

1. In R, choose the menu item Packages -> Install Packages2. Choose a CRAN site3. You will see a list of Packages4. Choose the aplpack package5. Your should see a message about accessing the package and

then the message“package ‘aplpack’ successfully installed and MD5 sums checked”6. To load the package, type library(aplpack)7. Run the following command:stem.leaf(rnorm(50))

Page 57: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

What is this command doing?stem.leaf(rnorm(50))

Page 58: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Dynamic R!Looping, Slicing

Page 59: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Using a loop, write an R script that converts Celsius readings between 25 and 30 into Fahrenheit

Page 60: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Dynamic R:Looping, Slicing

Imagine you have two vectors that you want to join into a single vector. Easy.

Page 61: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Dynamic R:Looping, Slicing

What if you only want the 2nd and 3rd and last members of vector c?

Page 62: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Create a vector of sequential numbers from 1 to 1000.

Loop through this vector and pick out only those numbers that are evenly divisible by 3 and create a new vector of these numbers.

How many elements in this new vector?

What is the sum of all the elements in the new vector?

Page 63: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011

Tips

• Create a directory/folder on you computer where you can keep your work– Each time you start R, be sure it is reading and writing

from this directory• Document your R scripts/commands using a text

editor– PLAIN TEXT only

• Glance through the user documentation that comes with R..consult that documentation if you get stuck

Page 64: Introduction to R Carol Bult The Jackson Laboratory Functional Genomics (BMB550) Spring 2011