a gentle introduction to r – how to load in data and produce summary statistics brc mh...
TRANSCRIPT
A gentle introduction to R – how to load in data and produce summary
statisticsBRC MH Bioinformatics group
Tutorial outline
• How to install R on your own computers– Its free– But its already installed on these computers
• Loading data from excel• Plotting• Summary statistics
Files
• Data and slides on:• http://core.brc.iop.kcl.ac.uk/brc-
bioinformatics-workshop-october-2012
Show file extensions
Show file extensions
• Uncheck ‘hide extensions for known file types’
• Click ‘Apply’
Installing R – skip as already installed
Installing R – skip as already installed
Installing R – skip as already installed
And follow operating system specific installation instructions
Installing R – skip as already installed
Starting R on these computers
Help files
Loading help files
• A useful function is read.table()– It allows you to read data from spreadsheets into
R
• To see it’s help file you can use• You can use ?function_name for any function
to see a help file
?read.table
Loading data into R from excel
From excelOpen testdata.xls
From excel• You need to save it as a comma separated
value file (.csv), go to file>save as>other formats
From excel
R working directory
• To open a file you will need to point R towards the folder that contains it.
• You can do this with setwd(), but we’ll do it using the mouse
• Suppose you have the file in My Documents
Browsing folders• To check that you are in the right folder type
• To see files in this folder you can type
• To list the current variables type
• Nothing should be loaded yet
getwd()
list.files()
ls()
Loading data
To follow along with this section, make sure your R working directory is that which contains the tutorial data
• Read the contents of file testdata.csv into an R variable my.data with:
• read.csv is a wrapper for read.table which lets you specify more details about your file, eg:
my.data <- read.csv(‘testdata.csv’)
my.data <- read.table(‘testdata.csv’,sep=‘,’,header=TRUE)
• sep : Column separator• header : Does the first row of the file contain column headers?• skip : Number of rows to skip at the top of the file
• ?read.table for other useful parameters
read.table()
Looking at loaded data
• Take a look at the top couple of lines:
• Generate some basic summary stats:
• Check your new variable is in the R environment:
ls()
head(my.data)
summary(my.data)
• Number of rows and columns
• Row and column names
• Check the dimensions of your dataset:
dim(my.data)
nrow(my.data)ncol(my.data)
rownames(my.data)colnames(my.data)
Subsetting Data
• Look at the first col:
• Look at the third column of row 10
• Look at the first row:
my.data[1,]
my.data[,1]
my.data[10,3]
• Look at the first column for rows 100 to 110
• Same as above, but save to a variable
• Same as above but pre-defining the index vector
• Look at rows 30,40,50 and 60
my.data[100:110,1]
my.subset <- my.data[100:110,1]
my.data[c(30,40,50,60),]
my.indices <- c(30, 40, 50, 60)my.data[my.indices,]
• Look at the columns named 'height' and 'weight' for row 1:
• Same as above but pre-define the colnames vector
• Look at the column named 'weight' for row 1:
You can subset on names instead of indices:
my.data[1,’weight’]
my.data[1,c(’weight’,’height’)]
cols <- c(’weight’,’height’)my.data[1,cols]
• Look at all columns except the second for row 1
• Extract all rows except 1-100
• Extract all rows except 35, 67,101
Negative indices exclude elements:
my.data[1,-2]
my.new.data <- my.data[-1:-100,]
my.indices <- -1 * c(35, 67, 101)my.new.data <- my.data[my.indices,]
Quiz!
• How tall is the person in the 7th row?
• What gender is the person in the 300th row?
• For the people in rows 20-30, who is the heaviest?
• For the people in rows 110, 350, 219, 74, who is the tallest?
• Save all rows except 500-600 in a variable my.new.data
• How many males and females are in this new dataset?
Formatting problems
Data isn't comma-separated?
• Specify the separator in read.table
• tab-delimited text is another common format, for which you can use sep=”\t”
Load "testdata.txt", a tab-delimited version of the data
Data has extra header information at the top?
• Either delete this data in Excel before exporting to csv
• Or, use the skip=N argument to read.table
Have a look at "testdata_1.csv" in Excel and then load it into R using read.table
Factors are inconsistently named
• R will just read in the data you give it.
• If you aren't consistent naming the levels of your factors it will see them as different levels
• R is case sensitive. 'MyLevel' != 'mylevel'
Load the data from testdata_2.csv and have a look at the gender variable.
Try and fix the problems in Excel and reload.
Measurements and units in a single column
• If you store values like 10kg, R will not interpret this as a numeric column
Try loading file 'testdata_3.csv' - what has happened to the weights and heights information?
Try loading again so that the two are loaded as character vectors.
Have a look at the sub() function and see if you can fix the problem
Excel has just screwed up your data
• Older versions of Excel have a limit of 65536 rows. If you open a larger dataset in Excel it will be truncated. If you then save this dataset you will be saving the truncated version.
Avoid opening large datasets in Excel, use R
• Excel tries to be helpful by formatting elements for you. Try the following and then open in Excel, save as csv and reload into R. What has happened?
my.genes<-c('MASH1','SOX2','OCT4')write.csv(my.genes, file='mygenes.csv')
Plotting
Drawing histograms
Optional exercises –
1) Try drawing a histogram of height
2) Try and label the x axis [hint: read the help file]
Drawing normal QQ plotsqqnorm(my.data$weight);qqline(my.data$weight)
Drawing scatterplots
Optional exercises: try these, do you understand this plot?
plot(height~weight,data=my.data)
plot(height~weight,data=my.data,col=as.numeric(gender))
Drawing boxplotsboxplot(height~gender,data=my.data)
Saving plots
JPEGs
PDFs
jpeg(“boxplot.jpg”)boxplot(height~gender,data=my.data)dev.off()
pdf(“boxplot.pdf”)boxplot(height~gender,data=my.data)dev.off()
Summary statistics
Functions Covered
http://www.statmethods.net/index.html
Writing tables
Calculate Mean and SD
Correlate phenotypes and test for group differences
It is always important to check model assumptions before making statistical inferences
Linear regression