a gentle introduction to r – how to load in data and produce summary statistics brc mh...

A gentle introduction to R – how to load in data and produce summary

statisticsBRC MH Bioinformatics group

Tutorial outline

• How to install R on your own computers– Its free– But its already installed on these computers

• Loading data from excel• Plotting• Summary statistics

Files

• Data and slides on:• http://core.brc.iop.kcl.ac.uk/brc-

bioinformatics-workshop-october-2012

Show file extensions

Show file extensions

• Uncheck ‘hide extensions for known file types’

• Click ‘Apply’

Installing R – skip as already installed

And follow operating system specific installation instructions

Installing R – skip as already installed

Starting R on these computers

Help files

Loading help files

• A useful function is read.table()– It allows you to read data from spreadsheets into

R

• To see it’s help file you can use• You can use ?function_name for any function

to see a help file

?read.table

Loading data into R from excel

From excelOpen testdata.xls

From excel• You need to save it as a comma separated

value file (.csv), go to file>save as>other formats

From excel

R working directory

• To open a file you will need to point R towards the folder that contains it.

• You can do this with setwd(), but we’ll do it using the mouse

• Suppose you have the file in My Documents

Browsing folders• To check that you are in the right folder type

• To see files in this folder you can type

• To list the current variables type

• Nothing should be loaded yet

getwd()

list.files()

ls()

Loading data

To follow along with this section, make sure your R working directory is that which contains the tutorial data

• Read the contents of file testdata.csv into an R variable my.data with:

• read.csv is a wrapper for read.table which lets you specify more details about your file, eg:

my.data <- read.csv(‘testdata.csv’)

my.data <- read.table(‘testdata.csv’,sep=‘,’,header=TRUE)

• sep : Column separator• header : Does the first row of the file contain column headers?• skip : Number of rows to skip at the top of the file

• ?read.table for other useful parameters

read.table()

Looking at loaded data

• Take a look at the top couple of lines:

• Generate some basic summary stats:

• Check your new variable is in the R environment:

ls()

head(my.data)

summary(my.data)

• Number of rows and columns

• Row and column names

• Check the dimensions of your dataset:

dim(my.data)

nrow(my.data)ncol(my.data)

rownames(my.data)colnames(my.data)

Subsetting Data

• Look at the first col:

• Look at the third column of row 10

• Look at the first row:

my.data[1,]

my.data[,1]

my.data[10,3]

• Look at the first column for rows 100 to 110

• Same as above, but save to a variable

• Same as above but pre-defining the index vector

• Look at rows 30,40,50 and 60

my.data[100:110,1]

my.subset <- my.data[100:110,1]

my.data[c(30,40,50,60),]

my.indices <- c(30, 40, 50, 60)my.data[my.indices,]

• Look at the columns named 'height' and 'weight' for row 1:

• Same as above but pre-define the colnames vector

• Look at the column named 'weight' for row 1:

You can subset on names instead of indices:

my.data[1,’weight’]

my.data[1,c(’weight’,’height’)]

cols <- c(’weight’,’height’)my.data[1,cols]

• Look at all columns except the second for row 1

• Extract all rows except 1-100

• Extract all rows except 35, 67,101

Negative indices exclude elements:

my.data[1,-2]

my.new.data <- my.data[-1:-100,]

my.indices <- -1 * c(35, 67, 101)my.new.data <- my.data[my.indices,]

• How tall is the person in the 7th row?

• What gender is the person in the 300th row?

• For the people in rows 20-30, who is the heaviest?

• For the people in rows 110, 350, 219, 74, who is the tallest?

• Save all rows except 500-600 in a variable my.new.data

• How many males and females are in this new dataset?

Formatting problems

Data isn't comma-separated?

• Specify the separator in read.table

• tab-delimited text is another common format, for which you can use sep=”\t”

Load "testdata.txt", a tab-delimited version of the data

Data has extra header information at the top?

• Either delete this data in Excel before exporting to csv

• Or, use the skip=N argument to read.table

Have a look at "testdata_1.csv" in Excel and then load it into R using read.table

Factors are inconsistently named

• R will just read in the data you give it.

• If you aren't consistent naming the levels of your factors it will see them as different levels

• R is case sensitive. 'MyLevel' != 'mylevel'

Load the data from testdata_2.csv and have a look at the gender variable.

Try and fix the problems in Excel and reload.

Measurements and units in a single column

• If you store values like 10kg, R will not interpret this as a numeric column

Try loading file 'testdata_3.csv' - what has happened to the weights and heights information?

Try loading again so that the two are loaded as character vectors.

Have a look at the sub() function and see if you can fix the problem

Excel has just screwed up your data

• Older versions of Excel have a limit of 65536 rows. If you open a larger dataset in Excel it will be truncated. If you then save this dataset you will be saving the truncated version.

Avoid opening large datasets in Excel, use R

• Excel tries to be helpful by formatting elements for you. Try the following and then open in Excel, save as csv and reload into R. What has happened?

my.genes<-c('MASH1','SOX2','OCT4')write.csv(my.genes, file='mygenes.csv')

Plotting

Drawing histograms

Optional exercises –

1) Try drawing a histogram of height

2) Try and label the x axis [hint: read the help file]

Drawing normal QQ plotsqqnorm(my.data$weight);qqline(my.data$weight)

Drawing scatterplots

Optional exercises: try these, do you understand this plot?

plot(height~weight,data=my.data)

plot(height~weight,data=my.data,col=as.numeric(gender))

Drawing boxplotsboxplot(height~gender,data=my.data)

Saving plots

JPEGs

PDFs

jpeg(“boxplot.jpg”)boxplot(height~gender,data=my.data)dev.off()

pdf(“boxplot.pdf”)boxplot(height~gender,data=my.data)dev.off()

Summary statistics

Functions Covered

http://www.statmethods.net/index.html

Writing tables

Calculate Mean and SD

Correlate phenotypes and test for group differences

It is always important to check model assumptions before making statistical inferences

Linear regression

a gentle introduction to r – how to load in data and produce summary statistics brc mh...

Documents

filesls slide

tutorial data slide

file extensions slide

files data

r skip

loading data

starting r

r variable