r basics albin sandelin. ultra-short r introduction most life scientists use spreadsheet programs...

Post on 11-Jan-2016

213 Views

Category:

Documents

0 Downloads

Preview:

Click to see full reader

TRANSCRIPT

R basics

Albin Sandelin

Ultra-short R introduction

Most life scientists use spreadsheet programs for analysis - like excel or statistica. Why?

Ease of use

- click buttons, select data by hand

-you see the data in front of you

-you can do limited programming

Disadvantages

• Hard to handle large dataset (>100 points)

• Inflexible, few analyses available

• “Dumb-down” click button effect

• Hard to repeat analyses systematically with new data

R• R is a computational environment - somewhere

between a program and a programming language• No buttons, no wizards: only a command line interface• Is a professional statistics toolset - likely has all the

analyses you will ever need in your career• Is also a programming language• Can handle large datasets• Very powerful graphics• The state-of-the-art statistics program for

bioinformatics, especially array analysis• Free, and open source!

R in this course

In this course, we will use R forExploring data

Select subsets of dataPlotting data

Statistical tests on data…etc

= almost everything.It is the most important tool in this course!

R is hard

R is challenging to learn:No visual cuesVery different from spreadsheetsHard to find the method to useCryptic manual pages (at times)

It requires a lot of effort to get inside the system, but bear in mind that it is worth it - you gain an edge to all spread-sheet users

Reading material• Keep the reference sheet handy• Save code from lecture exercises in a text file• Powerpoint slides are available after lectures

at the web page• Can also recommend the tutorials on the

CRAN website – this is how I learned R• Dahlgaard book is very good, also for future

reference

Expected:

• You will have R open at all times during this course

• We will make MANY smaller exercises during class, most often in pairs

• Understanding these exercises is the key to succeeding in the exam

• We will have at least two teachers present at al lectures - use us to get help when needed

•NB: It is YOUR responsibility to that you understand the examples/exercises -if not, you have to ask for clarification during the walkthrough of the problems

First challenge of the day• Start R

• Type

demo(graphics)

• Hit enter a few times

Starting with R

We will now walk through the most basic R concepts

We will add successive detail on this – with many exercises - and in a while get over to plotting, basic statistics, and finally some programming

Challenging?

• Those who know R or have a math background might find most things for today no-so-challenging

• We cover this to keep everyone on the same ground

Vectors• In statistics, we are almost always dealing with several “data points”

• A vector is an collection of numbers and/or strings:

– (“albin”, “sanne”, “jens”)

– (1, 5.2, 10, 7, 2, 21)

– (3)

• The last example is a vector of length 1

In R, we make a vector by the c() command (for concatenate)

> c(1,5,10, 7, 2, 1)[1] 1 5 10 7 2 1

> c('albin', 'sanne', 'jens')[1] "albin" "sanne" "jens"

When we input strings or characters, we have to surround them with “ or ‘

If making vectors of size 1, we can skip c()> 3 [1] 3

Method name Method arguments

Method output is also called “return value(s)”

Challenge:1) Make the following vector45,5,12,10

2) What happens with the following command?c(1:100) c(50:2)

A vector is a data structure, and the most fundamental in R: almost everything in R is some kind of vector, although sometimes in several dimensions – vectors within vectors

The reference sheet is your friend!

• You will get over-whelmed by different command names fast

• Use the reference sheet to remind yourself in all exercises

Assignment to memory• The c() command is almost useless in itself - we want to keep the

vector for other analyses• The assignment concept:

> 4+5 # add 4 and 5[1] 9> a<-4 # store 4 as “a”> b<-5 # store 5 as “b”> a # just checking[1] 4> b[1] 5> a+b # add a+b (4+5)[1] 9 # yes, works

Anything after a # will be disregarded by R. Used for making comments

Expanding this to a whole vectormy_vector<- c(1,5,10, 7, 2)

Note that there is no “return value” now - this is caught by the “my_vector”. What happens if you type my_vector after this?my_vector is a variable, with the variable name my_vector. Variable names are totally arbitrary!The anatomy of the vector:

Name my_vector

Values 1 5 10 7 2

Index [1] [2] [3] [4] [5]

We can access part of the vector like thismy_vector[5] # will give the 5th item in the vector

What happens if you do this?

my_vector<- c(1,5,10, 7, 2) # define the vector

my_vector [c(1,3,5)]

my_vector[1:4]

my_vector[4:1]

> my_vector[c(1,3,5)][1] 1 10 2> my_vector[1:4][1] 1 5 10 7

Making a series of integers:A:B will give a vector of A,A+1, A+2…B

>my_vector[4:1] # also works backwards[1] 7 10 5 1

Challenge

• Using the reference sheet, figure out at least three ways of making R print your vector in the other direction - my_vector<- c(1,5,10, 7, 2) # define the vector

> my_vector[c(5,4,3,2,1)]

[1] 2 7 10 5 1

> my_vector[c(5:1)]

[1] 2 7 10 5 1

> rev(my_vector)

[1] 2 7 10 5 1

>

Interlude: naming rules and the danger of over-writing

Naming: We can name vectors to almost anything. The most basic rule is : Never start a vector name with a number

> a<- c(1,5,4,2) #OK> 1a<- c(1,5,4,2) # NOT OKError: syntax error> a1<- c(1,5,4,2) # OK

Over-writing:my_vector<- c(1,5,10, 7, 2)my_vector<- c(10,5,2, 3, 1)#what does my_vector contain now?

Analyzing vectors• Many functions work directly on vectors - most have

logical names• For instance, length(my_vector) gives the number of

items in the vector (= 5)• Challenge:

– Make a vector big_vector with values 1 to 10000 using the a:b construct

– What is the • Length of the vector• Sum of all items in vector: the sum() function• Mean(= average) of all items in the vector: the mean() function

> big_vector<-(1:10000)> length(big_vector)[1] 10000> sum(big_vector)[1] 50005000> mean(big_vector)[1] 5000.5

Interlude: the R help system• R has hundreds of functions where most have

a non-trivial syntax: The easiest way of retrieving documentation is by saying

?function_name #ORhelp(function_name) #ORhelp.search(“something”) #for ‘fuzzy’ searches• Look at the help for sample() and sort() and

then try them out on big_vector• Also try out RSiteSearch(“sample”) – what

happens

Adding and multiplying a number to a vector

Sometimes we want to add a number, like 10, to each element in the vector:

> big_vector +10Test this:big_vector2<-big_vector +10Also testmin(big_vector)max(big_vector)min(big_vector2)max(big_vector2)What happens?

> big_vector2<-big_vector +10> min(big_vector)[1] 1> max(big_vector)[1] 10000> min(big_vector2)[1] 11> max(big_vector2)[1] 10010

Adding vectors

• We can also add one vector to another vector• Say that we have the three vectors

A<-c(10, 20, 30, 50)

B<-c(1,4,2,3)

C<-c(2.5, 3.5)• Test what happens and explain the outcome:

A+B

A+C

A+B is easy to understand : A[1]+B[1] , etc

A+C is trickier - the C vector is just of length 2. It is re-used! So

A[1]+C[1]=12.5

A[2]+C[2]=23.5

A[3]+C[1]=32.5

A[4]+C[2]=53.5

Actually, this is what is happening also with A+10 …the 10 is used many times.

> A+B[1] 11 24 32 53> A+C[1] 12.5 23.5 32.5 53.5

A<-c(10, 20, 30, 50)B<-c(1, 4, 2, 3 )C<-c(2.5,3.5 )

Plotting vectors

• Lets make up some semi-random data:dat<-rnorm (100) #draw 100 random, normal

distributed data points • Test the following:plot(dat)plot(dat,type=“l”)barplot(dat)hist(dat)• What is the difference?

Why are your three first plots different than mine?Why is your last plot more similar to mine?

Graph options• Generally, you can give extra options to graphical commands like

this• plot (some_vector, col=“blue”, type=“l”)• In plot: try to vary the following options - note that you can use

several at once (and figure out what they do)

type=“l”, “b”

col= “hotpink”

main=“important plot”

type=“h”

type=“S”

These options are really arguments to the plot() function

More about functionsIn almost all cases, a function needs some input, like

plot(dat).

‘dat’ here is an unnamed argument, and this works because plot() assumes we mean x values = dat.

We could also say plot(x=dat) - a named argument. If you have many arguments, most of them are named - such as

plot (some_vector, col=“blue”, type=“s”)

The help pages will tell you what type of arguments you can use

The danger of unnamed arguments…

• …is that the order of them will make a big difference. Try this out - what is the difference between the plot commands?

a<-rnorm(100)b<-rnorm(100)*2plot(a,b)plot(b,a)plot(x=b, y=a)

b

a

a

a

b

b

plot(a,b)

plot(b,a)

plot(x=b, y=a)

Some generic R arguments to plots - the par() function

The par() function is used to set general plot properties. It has hundreds of possible arguments - see ?par

Two very handy par() arguments is mfrow() and mfcol() - these will allow many plots in one page

You give these functions a vector of length 2 - this gives the number of cells in the page (see example)

Example:> par( mfrow=c(3,1) )> plot (a,b)> plot (b,a)> plot(x=b, y=a)

> par( mfrow=c(2,2) )> plot (a,b)> plot (b,a)> plot(x=b, y=a)

Challenge - can you get the three plots in a row using

mfrow?

> par( mfrow=c(1,3) )> plot (a,b)> plot (b,a)> plot(x=b, y=a)

Overlaying plots

• Sometimes we want to put may data sets within one graph, on top of each other

• This is often made by the lines() or points() command, like this

> plot(b, type="l", col="blue")> lines(a, col="red")

Why did I start with plotting b?What would have happed if usingpoints() instead of lines()?

Sizing graphs

Simple concept, but awkward to write

Change X scale: xlim=c(start_value, end_value)

Change Y scale: ylim=c(start_value, end_value)

> plot(a, type="l", col="blue")> plot(a, type="l", col="blue", ylim=c(-5,5))

Saving graphs

• Different on different systems!

• All systems can use the original tricky way with device() – see introduction PDFs if you have an interest. We won’t use this.

OSX

• Click on the graph, and just copy it. Will become a pdf or a bitmap when pasting.

• On my mac you have top open a new file in Preview and paste there first. On others you can paste it into other programs directly

• Or, just have the graphics window active and do file->save

Windows

Right-click on graph, copy as metafile or bitmap, paste

Linux

• Use awkward device() system, like pdf(…) or

• Use GIMP and just make a screen capture from there by “acquire”

Statistics part 1

• Describing the data: descriptive statistics

• What is the center point of the data?

• What is the spread of the data?

• How does the distribution look?

Summary statistics

• hist() (= Histogram) is a graphical way of summarizing distributions – it creates a number of “bins” and calculates how many of the data points that fall into each bin.

• We can also summarize by the center points in the data:– mean()

• median():– Sort all the data, pick the number in the center. If the

number of data points is even, take the mean of the two center points

• Challenge: • We make another vector dat2<-rnorm(10)And add a few extra points to itdat2<-c(dat2, 10, 10.5, 30 )• Test mean() and median() on dat2. Are they the same?

Can you explain the differences by plotting a histogram? What is the advantage/disadvantage of each measure?

> dat2<-rnorm(10)> dat2<-c(dat2, 10, 10.5, 30 )> median(dat2)[1] 0.11125> mean(dat2)[1] 3.636139> hist(dat2)

Means are sensitive to outliers! Very common situation in genomics…

Percentiles

• An extension of the median concept• Best explained by example:

– the 20th percentile is the value (or score) below which 20 percent of the observations may be found.

• The median is the same as the 50th percentile

• The first quartile is the 25th percentile, the third is the 75th

• Try summary(dat) and summary(dat2)

summary(dat) Min. 1st Qu. Median Mean 3rd Qu. Max. -2.20600 -0.60480 -0.06930 -0.01333 0.68030 3.49600 > summary(dat2) Min. 1st Qu. Median Mean 3rd Qu. Max. -2.4010 -0.5385 0.1113 3.6360 0.6771 30.0000

The command ecdf() (empirical cumulative distribution) calculates “all” percentiles in your data – and also understands plot()Try plot (ecdf(dat2))

Sorted values of dat2

What fraction of the data that has been covered at point X

Boxplots• An “easier” representation of ECDFs. Is based

on both making boxes that tell us about both center point and “spread” of the data

• First, calculate the first quartile, the median and the third quartile

• Calculate the “inter-quartile range” (IQR): 3rd quartile -1st quartile

• These will be used to draw a “box”>boxplot(dat)> rug(dat, side=2)

Median

3rd quartile

1st quartile

Interquartile range (IQR)

…continued

• Sounds more complicated than it is:Any data observation which lies more than 1.5*IQR lower than the first quartile is considered an outlier. Indicate where the smallest value that is not an outlier is by a vertical tic mark or "whisker", and connect the whisker to the box via a horizontal line. Do the same for higher values

Median

3rd quartile

1st quartile

The data point that is highest but within 1.5 IQR from the center

More extreme data - outliers

Variance, standard deviation and data spread

• What is the difference between these distributions?

• Same mean and median, but different spread over the x axis

• This can be measured by the variance of the data:

• It is basically the difference between each point and the mean, squared

• Standard deviation is simply variance squared.

• This gives nice statistical features that you will get to know if you study statistics

Why is variance and standard deviation important?

• Variance tells you something about the quality of measurements

• The higher the variance, the harder it is to with certainty say that two measurements are different

• (whiteboard demo)

• What are the R functions for variance and standard deviation?

• If we make some random datasmallset<-rnorm(100)

largeset<-rnorm(10000)

What is the variance and standard deviation for these?

Is the standard deviation really the square root of the variance (what is the function for square root?)

>smallset<-rnorm(100)

>largeset<-rnorm(10000)

>var(smallset)

[1] 0.9504884

>var(largeset)

[1] 0.98416

>sd(largeset)

[1] 0.9920484

>sd(smallset)

[1] 0.97493

>sqrt(var(smallset))

[1] 0.97493

Why do we get about the same variance?

Dealing with real data

• …involves interacting with messy data from other people

• Almost all interesting data is “real”

• We need methods to get the data into R to start with.

Reading data from files

• Typically, you have data in a .txt or .xls file - how do you get it into R for analysis?

• Four steps:– 1 Format it to tab-separated text file– 2 Tell R what directory to work in– 3 Read in the file– 4 Attach data to memory (optional)

What R likes:• A “data frame”(=table) with columns.

The first row can be column names, like this:

Lecturer height age Poker_games_won

Albin 181 34 14

Jens 189 31 10

Sanne 170 NA 21

Name row, “header”

String data

First data row

Numerical dataCode for missing dataDo not ask ladies for age…

Reading data(1)

• General rules: – Use tabs as separators– Each row has to have the same number of

columns– Missing data is NA, not empty

• As .txt: The ideal format!

• As .xls: save as tabbed text

Get some data to play with:Update: old file is wrong!

Please download the data again• See course web page!

• http://people.binf.ku.dk/albin/teaching/htbinf/R/• Download the zipped file – save this at a good place that

you know where it is• I will refer to this place on your hard drive as “the data

directory”

Reading data(2)

• Tell R where to work.– In R, you always work in a given directory, called

“working directory”.

– You can set this graphically in Windows or MacOS (next pages)

– Or use setwd(“some directory”)

– In Linux, it pays to start R in the same directory

• It pays to have one directory per larger analysis, and call the data files informative things (not “Data.txt”)

In OSXCurrent working directory Change working directory

In Windows

Reading in data (3)

my_data<- read.table(“filename”, header=T)

Only if you have column names – I always have that

Data frame object holding all the data

The actual command

Lets try…

• In the data directory, there is a file “flowers.txt”. Have a look at it (it is a normal text file – can be opened in almost anything). Now:

my_data<-read.table(“flowers.txt”, header=T)

names(my_data)

What happens – what does “names” do?

> my_data<-read.table("flowers.txt", h=T) # note to self: show name completion feature> names(my_data)[1] "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" "Species"

Sepal.LengthSepal.Width Petal.Length Petal.Width Species5.1 3.5 1.4 0.2 setosa4.9 3.0 1.4 0.2 setosa4.7 3.2 1.3 0.2 setosa4.6 3.1 1.5 0.2 setosa5.0 3.6 1.4 0.2 setosaEtc…

Accessing the columns

• There are three ways of getting the data in a data frame– Get the column by saying

dataframe_name$column_name– Get the columns as separate vectors, named

as their header - the attach command– More complex queries, using index

operations (more in a few slides)

Columns by names

• The attach(some_data_frame) function will put all the columns of some_data_frame as vectors in working memory

• So, in our case• attach(my_data)

• Will give the 5 vectors • "Sepal.Length" "Sepal.Width" "Petal.Length" "Petal.Width" and "Species"

• (which we then can analyze)

Dangers of the attach() function

• Attach will make a vector out each column, named after the column name

• What happens if we already have a vector with that name?

• What does ls() and detach() do? Why would we want to do this?

Challenge

• Using the flowers file, – Plot a histogram of the length of sepals (“Sepal.Length”)– Plot the correlation between sepal length and sepal

width (an x-y plot)

• What is the mean of Sepal.Length and Sepal.Width?– Why do you get strange results on Sepal.Length?– Can you solve the problem by using the appropriate

help file?

> hist(Sepal.Length)> plot(Sepal.Length, Sepal.Width)

> mean(Sepal.Width)[1] 3.057333> mean(Sepal.Length)[1] NA>

# strange!

If we look at the file:Sepal.LengthSepal.Width Petal.Length Petal.Width Species5.1 3.5 1.4 0.2 setosa4.9 3.0 1.4 0.2 setosaNA 3.2 1.3 0.2 setosa4.6 3.1 1.5 0.2 setosa5.0 3.6 1.4 0.2 setosa5.4 3.9 1.7 0.4 setosa4.6 3.4 1.4 0.3 setosa

Missing values

• In real data, we often have missing values. In R, we label these “NA”

• Any examples of missing data?

• Some functions can by default deal with these - for instance hist()

• Some functions will need to be told what to do with them, like mean()

> mean(Sepal.Length)[1] NA#na.rm is the option for mean to remove all NAs first > mean(Sepal.Length, na.rm=T)[1] 5.851007

With new files, you should check whether there are any NA values first.

What if we wanted just the 5 first rows? Or just all values from the

species versicolor?Sepal.Length Sepal.Width Petal.Length Petal.Width Species5.1 3.5 1.4 0.2 setosa4.9 3.0 1.4 0.2 setosa4.7 3.2 1.3 0.2 setosa4.6 3.1 1.5 0.2 setosa5.0 3.6 1.4 0.2 setosa6.2 2.9 4.3 1.3 versicolor5.1 2.5 3.0 1.1 versicolor5.7 2.8 4.1 1.3 versicolor6.3 3.3 6.0 2.5 virginica5.8 2.7 5.1 1.9 virginica7.1 3.0 5.9 2.1 virginica6.3 2.9 5.6 1.8 virginica

The index concept• Recap: • In a vector, we can select individual values by saying• Sepal.Width[some position]> Sepal.Width[5][1] 3.6 • Sepal.Width[some_positions]> Sepal.Width[2:10][1] 3.0 3.2 3.1 3.6 3.9 3.4 3.4 2.9 3.1Sepal.Width[c(2,4,6,8)][1] 3.0 3.1 3.9 3.4

• The first position will always have the number 1, the last will have the length of the vector.

• Challenge: what is the mean of the first 100 values of Sepal.Width?

Albin’s code

• > a<-Sepal.Width[1:100]

• > mean(a)

• [1] 3.099

• # or, in just one line

• > mean(Sepal.Width[1:100])

• [1] 3.099

Selecting parts of vectors depending on values

• Often, we want parts of vector due to the values of the vector -for instance all values <10, or all strings equaling “versicolor”.

• This is made by a slightly odd syntax(it will make sense later on when we have many vectors)

• To get all values over 10 in Sepal.Length

Sepal.Length[ Sepal.Length >10 ]• So, we are using a “criterion” instead of the index

Try out

• Make a boxplot of all values in Sepal.Length that are higher than 5

• How many vales are there that satisfy this criterion?

Albin’s code

a<-Sepal.Length[Sepal.Length >5 ]boxplot (a)length(a)

[1] 119

criteria in R• < Smaller than• > Bigger than• >= Bigger or equal to• <= smaller than or equal to• == equal to (not “=“)• != not equal to

• Challenge: how many rows for the virginica species?

> length(Species[Species==virginica] )Error in NextMethod("[") : object "virginica" not found

# what happens here is that R thinks virginica is an object - a vector or something else

> length(Species[Species=="virginica"] )[1] 50

For string comparisons, we need to enclose the string with “ ” or ‘’

Selection using two vectors

• Often we have two vectors from a file, and we want to use the value of one vector to select values from the other.

• For instance, sepal length broken up by species

Sepal.Length Sepal.Width Petal.Length Petal.Width Species5.1 3.5 1.4 0.2 setosa4.9 3.0 1.4 0.2 setosa4.7 3.2 1.3 0.2 setosa4.6 3.1 1.5 0.2 setosa5.0 3.6 1.4 0.2 setosa6.2 2.9 4.3 1.3 versicolor5.1 2.5 3.0 1.1 versicolor5.7 2.8 4.1 1.3 versicolor6.3 3.3 6.0 2.5 virginica5.8 2.7 5.1 1.9 virginica7.1 3.0 5.9 2.1 virginica6.3 2.9 5.6 1.8 virginica

• Use the same syntax as before - but change the vetor names!

• some_vector[ other_vector == something]• Challenge: • select the Sepal.Width values that

correspond to the setosa species• What is the mean of these values?• What is the distribution of these values?

Albin’s code> a<-Sepal.Width[Species=="setosa"]> mean(a)[1] 3.428> hist(a)

Fancy way of breaking up data (only works with some

functions)• boxplot (Sepal.Width~Species)

• Try it out…

• This is a “formula” construction - break up width in the levels of species

Even more complex: multiple criteria

• & “and”: combine two criteria• | “or”: either this or that has to be trueSo, Species[ Sepal.Width >4 & Sepal.Length <5] will give the species rows where width >4 AND Length is <5.

…and, Species[ Sepal.Width >4 | Sepal.Length <5] will give?

Challenge: Make a boxplot of Sepal.Width where Sepal.Length >3 AND the species is virginica

>boxplot (Sepal.Width[Sepal.Length >3 & Species=="virginica"] )

More complex data selections

• The dataframe is really a two-dimensional vector - a vector that contains vectors

• We can use almost the same commands to select sub-sets of the data frame, without using the attach() function

• This is very handy at times

• The data frame is really a big table - a 2D vector.• Just as some_vector[5] gives use the fifth value in a

vector, • some_dataframe[10, 2] gives us the cell in the 10th

row and the 2nd column

>my_data[10, 2]

[1] 3.1

Sepal.Length Sepal.Width Petal.Length Petal.WidthSpecies5.1 3.5 1.4 0.2 setosa4.9 3.0 1.4 0.2 setosa4.7 3.2 1.3 0.2 setosa4.6 3.1 1.5 0.2 setosa5.0 3.6 1.4 0.2 setosa5.4 3.9 1.7 0.4 setosa4.6 3.4 1.4 0.3 setosa5.0 3.4 1.5 0.2 setosa4.4 2.9 1.4 0.2 setosa4.9 3.1 1.5 0.1 setosa5.4 3.7 1.5 0.2 setosa4.8 3.4 1.6 0.2 setosa4.8 3.0 1.4 0.1 setosa

Selecting rows• If we instead say• my_data[10, ]• We will get the WHOLE 10th row back• One can think of it as saying

my_data[10, “give me everything”]

• So, what will happen if we say

my_data[1:5, ]• Or

my_data[5:1, ]• Try it out…

Indexing exercise

• In the data directory, there is a file called Rindexing.pdf – it is a set of small exercises designed to help you understand indexing

• It works with a very small data file, just so that you can check your results by eye

• Indexing is very important for the rest of the course

Nota Bene: 1)Students on the reserve list

• Please talk to me to get signed on to the course (I need your student number or similar)

• Otherwise you cannot go to the exam => no credit

• 2) A watch was found in the last lecture – will be given back if you can describe it well

Sorting one vector by another

• Very common situation with experimental data – what are the genes with the highest expression in my data set?

• Recap: what happened if you said

my_data[1:5, ]• Or

my_data[5:1, ]• ?

# get some data to play with

d<-read.table("mtcar.txt", h=T)

attach(d)

Just by looking at the data,

d;

what car has the highest and lowest horsepower (=hp)?

• car_name mpg cyldisp hp drat wt qsecvs am gear carb• 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4• 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4• 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1• 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1• 5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2• 6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1• 7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4• 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2• 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2• 10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4• 11 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4• 12 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3• 13 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3• 14 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3• 15 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4• 16 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4• 17 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4• 18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1• 19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2• 20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1• 21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1• 22 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2• 23 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2• 24 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4• 25 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2• 26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1• 27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2• 28 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2• 29 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4• 30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6• 31 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8• 32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

Permutations

sort.list(x) will give the permutation of x that sorts x

Is much easier to explain with a small example…

> a<-c(3,1,10)

> sort.list(a)

[1] 2 1 3

> a<-c(3,1,10)

> sort.list(a)

[1] 2 1 3

> b<-sort.list(a)

> a[b]

[1] 1 3 10

sort.list(hp)

[1] 19 8 20 18 26 27 3 9 21 6 32 1 2 4 28 10 11 22 23 5 25 30 12 13 14 15 16 17 7 24 29 31

• car_name mpg cyl disp hp drat wt qsec vs am gear carb• 1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4• 2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4• 3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1• 4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1• 5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2• 6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1• 7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4• 8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2• 9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2• 10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4• 11 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4• 12 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3• 13 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3• 14 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3• 15 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4• 16 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4• 17 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4• 18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1• 19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2• 20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1• 21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1• 22 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2• 23 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2• 24 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4• 25 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2• 26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1• 27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2• 28 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2• 29 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4• 30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6• 31 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8• 32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

19 8 20 18 26 27 3 9 21 6 32 1 2 4 28 10 11 22 23 5 25 30 12 13 14 15 16 17 7 24 29 31

Thought exercise

What happens if we now say

hp[c(19 , 8 ,20 ,18, 26, 27, 3, 9, 21, 6, 32, 1, 2, 4, 28, 10, 11, 22, 23, 5, 25, 30, 12, 13, 14 ,15 ,16 ,17 , 7 ,24 ,29, 31)]

?

We get hp sorted…

hp[c(19 , 8 ,20 ,18, 26, 27, 3, 9, 21, 6, 32, 1, 2, 4, 28, 10, 11, 22, 23, 5, 25, 30, 12, 13, 14 ,15 ,16 ,17 , 7 ,24 ,29, 31)]

[1] 52 62 65 66 66 91 93 95 97 105 109 110 110 110 113 123 123 150 150 175 175 175 180 180 180 205 215 230 245 245 264 335

So, sort.list really gives an instruction how to sort

So, we can also say

sort_order<-sort.list(hp)

hp[sort_order]

What will happen if we say

car_name[sort_order]

?

car_name[sort_order]

[1] Honda Civic Merc 240D Toyota Corolla Fiat 128 Fiat X1-9 Porsche 914-2 Datsun 710

[8] Merc 230 Toyota Corona Valiant Volvo 142E Mazda RX4 Mazda RX4 Wag Hornet 4 Drive

[15] Lotus EuropaMerc 280 Merc 280C Dodge Challenger AMC Javelin Hornet Sportabout Pontiac Firebird

[22] Ferrari Dino Merc 450SE Merc 450SL Merc 450SLC Cadillac Fleetwood Lincoln Continental Chrysler Imperial

[29] Duster 360 Camaro Z28 Ford Pantera L Maserati Bora

32 Levels: AMC Javelin Cadillac Fleetwood Camaro Z28 Chrysler Imperial Datsun 710 Dodge Challenger Duster 360 Ferrari Dino Fiat 128 ... Volvo 142E

So, we use the hp permutation to reorder the cars!

We can also do this with the whole dataframe, in the same way

>d[sort_order,]

> d[sort_order,]

car_name mpg cyl disp hp drat wt qsec vs am gear carb

19 Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2

8 Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2

20 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1

18 Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1

26 Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1

27 Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2

3 Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1

9 Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2

21 Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1

6 Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1

32 Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2

1 Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4

2 Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4

4 Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1

28 Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2

10 Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4

11 Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4

22 Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2

23 AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2

5 Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2

25 Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2

30 Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6

12 Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3

13 Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3

14 Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3

15 Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4

16 Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4

17 Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4

7 Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4

24 Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4

29 Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4

31 Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8

What type of data are we looking at?

• Type here refers to what kind of data representation - for example “strings” or “numbers”

• R usually does a good job at interpreting this for us when we read data

• This becomes important later on, as some functions only take vectors of a certain data type - for instance, you cannot take the mean() of a vector with “words”!

Simple types

• Numbers:num_vector<- c(1,5,2,10)

• Strings(or characters)String_vector<-c( “a”, “trial1”)

• How can we know?

class(num_vector)

[1] "numeric"

Factors/groups• Factors: slightly more tricky. A factor is often a

grouping variable - such as the gender or experiment group, like

gender<-c(‘f’, ‘f’, ‘m’, ‘m’, ‘f’)experiment_number<-c(1,2,4,5,5,6)• Problem here is that R will think that the first vector is

a string vector, and the last a numerical vector. > class(gender)[1] "character"> class(experiment_number)[1] "numeric"

• This is resolved by

• > experiment_number<-as.factor( c(1,2,4,5,5,6))

• > class(experiment_number)

• [1] "factor"

Over to analysis of data!

boxplot (Sepal.Width~Species)

Is there a real difference?

We now move over from “pure R” to “using R to make

statistical tests” • Focus is: what typical tests are there

and how to use them

• We skip almost all statistical theory(!) - covered in the course in fall

Population and sample• Population refers to a set of potential measurements

or values, including not only cases actually observed but those that are potentially observable. Really depends on the question at hand!

• Sample is the data we actually have - generally this is very small compared to the whole population, but we use it to draw inferences about the population

• So, the sample is a small part of a much larger population, in most cases

• Can sample==population? Examples?

Significance tests• In real life, we always get some difference

between datasets, because we sample limited data, and there is always noise

• So, is the difference in sepal width we see between species something that is just due to “luck” - or is it real?

• Challenge: What three features in the dataset will be important for finding this out (significant difference or just difference due to random noise) ? Discuss with your sideman for a minute

Albin’s take

• The observed difference– Obviously, the bigger observed difference, the less

random it is

• The data depth– The more data you have, the less likely it is that

strange things happens due to random effects

• Data spread (variance)– Slightly less intuitive. The less spread, the less it

can be explained by random events.

The concept of null hypothesis and P-values

• Always in statistics, we set a null-hypothesis. This is basically what we want to disprove!

• Usually, this is something like “there is no difference in mean in these two samples”. The alternative hypothesis is then, by default, “There is a difference”

• Then, we test how likely this is, with certain assumptions (depending on what test)

• The P-value of a test can be interpreted as the chance that the observed difference occurs if the null hypothesis is true– This only makes sense if your test is relevant!

• So, a small P-value means that the null hypothesis is not a good model - which means that there is a difference

• We often want P-values to be as small as possible - the standard cutoff for “significance” is 5%, or 0.05

• However, this is just tradition – it is completely arbitrary

Two-sample t-test

• [deliberate skip of math details – Jens will cover some later on]

• The t-test tests if two distributions come from distributions with the same mean.

• The assumption is that both distributions are “normal-shaped” - we should always check this first

Mean

Standard deviation – a measure of spread

Normality - brief check• There are many tests for normality. Now

we will do the simplest: by eye. Make a histogram and see if the data looks reasonably “bell-shaped”

• Challenge: Test to make histograms for the sepal length of each species

Albin’s take

#warning: wrong approach

hist(Sepal.Length)

# not very normal-shaped!

#why is this the wrong approach?

Albin’s take, again#now for each species

par (mfrow=c(1,3))

hist(Sepal.Length[Species=="setosa"])

hist(Sepal.Length[Species=="versicolor"])

hist(Sepal.Length[Species=="virginica"])

#IMHO, these are not very far away from normal-shaped given the amount of data

Slightly more advanced way – the quartile-quartile plot

• If two distributions have the same shape, their cumulative distributions should match well

4 5 6 7 8

0.0

0.2

0.4

0.6

0.8

1.0

ecdf(Sepal.Length)

x

Fn

(x)

-2 -1 0 1 2

0.0

0.2

0.4

0.6

0.8

1.0

ecdf(rnorm(50))

x

Fn

(x)

-2 -1 0 1 2

4.5

5.0

5.5

6.0

6.5

7.0

7.5

8.0

Normal Q-Q Plot

Theoretical Quantiles

Sa

mp

le Q

ua

ntil

es

qqnorm(Sepal.Length)

-2 -1 0 1 2

4.5

5.0

5.5

Normal Q-Q Plot

Theoretical Quantiles

Sa

mp

le Q

ua

ntil

es

qqnorm(Sepal.Length[Species=="setosa"])

So, are they different?> setosa<-Sepal.Length[Species=="setosa"]> versicolor<-Sepal.Length[Species=="versicolor"]

> t.test(setosa, versicolor)

Welch Two Sample t-test

data: setosa and versicolor t = -10.521, df = 86.538, p-value < 2.2e-16 #Very!alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: -1.1057074 -0.7542926 sample estimates:mean of x mean of y 5.006 5.936

Lets test it with some real data

In your data directory, there are two files with binding sites lengths, from estrogen receptor alfa and beta: ERA_widths.txt, ERB_widths.txt

1) Read these into two different vectors, called era, erb

2) Have they significantly different means?

Albins code> d<-read.table("ERA_widths.txt", h=T)> attach(d)> d<-read.table("ERB_widths.txt", h=T)> attach(d)>

> t.test(ERA_width, ERB_width)

Welch Two Sample t-test

data: ERA_width and ERB_width t = 2.7288, df = 1066.975, p-value = 0.00646alternative hypothesis: true difference in means is not equal to 0 95 percent confidence interval: 11.18740 68.45565 sample estimates:mean of x mean of y 514.1370 474.3155

Hey, we forgot to check for normality!

> par(mfrow=c(1,2))> hist (ERA_width)> hist (ERB_width)# NOT really normal shaped! t.test is not valid!

The two-sample Wilcoxon test

• The t-test is assuming that samples are normal-distributed, which is often not true with biological data

• The wilcoxon test : wilcox.test() is a less powerful test which does not make such assumptions.

• Tests of this kind are called non-parametric• Use when the normality assumption is

doubtful

Challenge

• Repeat the analysis of the ERA/beta sites using wilcox.test. Is the difference significant?

> wilcox.test(ERA_width, ERB_width)

Wilcoxon rank sum test with continuity correction

data: ERA_width and ERB_width W = 151243.5, p-value = 0.05551alternative hypothesis: true location shift is not equal to 0

The “AHH, so close to significance” situation

Different types of alternative hypothesis

Null hypothesis:“there is no difference”

Default alternative hypothesis: “there is difference”. This is called “two-sided”, as the differences can be in either direction

Optional: We can make the alternative hypothesis “one-sided” - like “There is a difference - distribution A has a greater mean than distribution B”. This increases the statistical power *2

In R> wilcox.test(ERA_width, ERB_width)

Wilcoxon rank sum test with continuity correction

data: ERA_width and ERB_width W = 151243.5, p-value = 0.05551alternative hypothesis: true location shift is not equal to 0

> wilcox.test(ERA_width, ERB_width, alternative="greater")

Wilcoxon rank sum test with continuity correction

data: ERA_width and ERB_width W = 151243.5, p-value = 0.02776alternative hypothesis: true location shift is greater than 0

# what happens if alternative=“less“?

> wilcox.test(ERA_width, ERB_width, alternative="less")

Wilcoxon rank sum test with continuity correction

data: ERA_width and ERB_width W = 151243.5, p-value = 0.9723alternative hypothesis: true location shift is less than 0

Dangers with “hypothesis fiddling”

• This should never be used to get “twice as good P-value” without some justification

• Homework for Tuesday: read BMJ 1994;309:248 (linked at the web page) for discussion on when this is appropriate

Testing many samples at once

• Sometimes, we have a table of groups, and we want to see if any of the groups are different in means

treatment1 treatment2 treatment3plant1plant2plant3etc

• This is the typical one-way Anova test situation

boxplot (Sepal.Width~Species)

Is there a real difference?

Last time we checked this, we just tested setosa vs versicolor

In R, oneway anova tests can be made easy by the oneway.test() command

The oneway.test() works by giving it a formula, in our case

Sepal.Width~Species. This will basically give the function a “table” with

three columns of values, one for each species. It will then test if the means of any of these columns

are different from the others

Test flowers set> d<-read.table("flowers.txt", h=T)> attach(d)oneway.test(Sepal.Width~Species)

One-way analysis of means (not assuming equal variances)

data: Sepal.Width and Species F = 45.012, num df = 2.000, denom df = 97.402, p-value = 1.433e-14

As expected, very, very significant.Is the difference in length of sepals (not widths) even more significant? Test it!

Non-parametric alternative to Anova

As with t-tests, Anova tests assume that each of the columns has a normal-like distribution. This is often not the case

The Kruskal-Wallis test is analogous to the wilcoxon test - a non-paramatric variant. Practically, it works exactly in the same way as the oneway.test().

2 data sets >2 data sets

Normal distribution

t.test() oneway.test()

Not normal distribution (non-parametric tests)

wilcox.test() kruskal.test()

Larger exercise

In the data folder, there is file gene_lengths.txt, showing gene lengths from three different chromosomes.

Is there a significant difference in the center points (means or medians) of gene lengths for the different chromosomes? What is the method of choice? What two chromosomes have the most/least significant difference?

> d<-read.table("gene_lengths.txt", h=T)> attach(d)> names(d)[1] "id" "chr" "gene_length”#first - is the gene length distribution normal-shaped?> hist(gene_length)# nope!# => nonparametric tests # needed

# many more than two groups, # and non-parametric test needed =>kruskal wallis test> kruskal.test(gene_length~chr)

Kruskal-Wallis rank sum test

data: gene_length by chr Kruskal-Wallis chi-squared = 131.7681, df = 2, p-value < 2.2e-16# very significant difference!

#are all chromosome pairs significant?# test case: use wilcox.test on two chromosomes> wilcox.test(gene_length[chr=="chr2"] , gene_length[chr=="chr3"])

Wilcoxon rank sum test with continuity correction

data: gene_length[chr == "chr2"] and gene_length[chr == "chr3"] W = 1068584, p-value = 0.1822alternative hypothesis: true location shift is not equal to 0 # so, all pairs are not significantly different!

chr1 chr2 chr3

chr1 1 2.2E-16 2.2E-16

chr2 1 0.1822

chr3 1

#for us who are lazy - we can do all vs all tests like this:

> pairwise.wilcox.test(gene_length, g=chr )

Pairwise comparisons using Wilcoxon rank sum test

data: gene_length and chr

chr1 chr2chr2 <2e-16 - chr3 <2e-16 0.18

P value adjustment method: holm

Small homework I: Reading about one-and two-sided testsWhat is horribly wrong in this (published) study?

Small homework (II)

What is the statistical problem with making these pairwise tests en masse, for instance making 1000 t-tests and reporting all cases where P<0.05?

Hint: Go back to the slides about what the 5% P value cutoff really means

Homework and ExaminationBig homework for examination goes online this evening!

Examination• Three home works during the course

– These will together give 50% of the final points– Main focus here is technical skill (how to do analysis x)– Theoretically optional. Very strict deadlines.

• One large home work at the end of the course– Will give 50% of final points– Overlaps substantially with the three first, but is more high-level– Main focus here is to merge technical skill with biologically

motivated analysis - real science• Do the math…the three smaller home works pays off

• Note: High level of activity needed from the start of the course. Will be demanding! The home works are the examination!

• The points with homework– Encourage study during the course

– It is easy to think that you understood something, but when meeting the real thing…

– Only realistic way of training– Evaluation

Rules for homework

• Working together in 2-3 perons teams is allowed and encouraged, BUT– Each person hands in a separate report– …which must be unique. This is usually not a

problem if you write it yourself…– If you work with others, write who you work with on

the report (“ I worked with X and Y in this homework!)

• Cases of copying between students WILL be reported. The BEST thing that can happen is that you will get lower scores

Using material from others

• Copying text/images from other sources – including fellow students – is considered cheating

• You can quote sources, given that you cite them, but this has a limit

Questions?

Correlations• Sometimes, we do not want to show that two

things are different, but that they are correlated.• For instance, weight and height for individuals

are clearly different, but also highly correlated• This is a positive correlation - but we could also

have a negative correlation• Correlations are dangerous things - it depends

on how we ask questions .Easily the most misused and misunderstood part of statistics

• Correlation != Causation

Consider : Why is it that…

Truth: Compared to the general population, there is a strong correlation between being an American football player and risk for testicle cancer

Interpretation: American football leads to cancer!

Consider : Why is it that…

Truth: There is an inverse correlation between yearly income and running speed

Interpretation: carrying money makes you slow!

Time to exploreOpen the height.txt file in R, and look at

the different data types. Try to plot various columns against each other, like

plot (mean_height, Year)

What columns correlate? Can we break up the data even more?

Try it out!

>plot (mean_weight, year )> plot (mean_height, year )> plot (mean_height, mean_weight )

Hmmm…obviously there are TWO datasets here! How can we break this up?

Visualize the groups with colorplot (mean_weight, year )lines (mean_weight[gender=="F"], year[gender=="F"], col="red" )lines (mean_weight[gender=="M"], year[gender=="M"], col="blue" )# a typical case of OVER-plotting

Assessing correlations

• First: plot a standard x-y plot

• Second, make a correlation test:

• cor(x, y) gives the Pearson correlation coefficient between z an y, which goes from -1(perfect negative correlation) to +1(prefect correlation). 0=no correlation

Test it…

> cor(mean_weight, year)

[1] 0.1688499

#BAD correlation - close to zero

What happens if we break it up by gender as in the last plot? Test it!

> cor (mean_weight[gender=="M"], year[gender=="M"] )[1] 0.9527568> cor (mean_weight[gender=="F"], year[gender=="F"] )[1] 0.931249

# very, very correlated!

Important: Pearson correlation gives a “score”, not a significance value. Obviously, it is easier to get a good correlation if you only have a few values!

Often, we want to also want to TEST if the correlation is there. The null hypothesis is then 0 correlationThis is done with thecor.test()command

> cor.test (mean_weight, year )

Pearson's product-moment correlation

data: mean_weight and year t = 0.5934, df = 12, p-value = 0.5639alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: -0.3973253 0.6419208 sample estimates: cor 0.1688499

> cor.test (mean_weight[gender=="F"], year[gender=="F"] )

Pearson's product-moment correlation

data: mean_weight[gender == "F"] and year[gender == "F"] t = 5.7147, df = 5, p-value = 0.002293alternative hypothesis: true correlation is not equal to 0 95 percent confidence interval: 0.5965244 0.9900208 sample estimates: cor 0.931249

• Note that cor and cor.test are not the same thing

• cor gives a co-efficient – how well is the data described by a linear correlation

• cor.test tests if that co-efficient is 0 or not. If you have enough data, you don’t need a nice cor value to get a small p-value. So cor.test doe not test if the correlation is linear or not

Making a simple linear model fit

• If there is a correlation, we often want to fit a simple model to it – the classical Y=kx+m

• R does this by the lm() function, which also can do a lot more complex things

Here, we need to use the formula construct that we used before, solm (y~x) # make a model where x describes the outcome yThis will return a lm “object”, which holds all kinds of information - the model, how well it fitted the data, etc

Some caveats

Here, we deal with linear regression - one of the simplest usages of lm()

lm() can do a lot of other funky things that are “advanced manual”

Some necessary math

In linear regression, we are fitting the response (y) to the data(x) by a linear function

Y=kX+m

Where k is the “slope”, and m the intercept (high school math!)

How to make a linear regression between mean weight and years, only females

lm( year[gender=="F"] ~mean_weight[gender=="F"] )Call:lm(formula = year[gender == "F"] ~ mean_weight[gender == "F"])

Coefficients: (Intercept) mean_weight[gender == "F"] 1616.551 6.297

Y=kX+m

Plotting a fit

# first, plot the actual data: plot (mean_weight[gender=="F"] , year[gender=="F"] , col=“red”)# x y# then make the fit - note the switch of year and mean_weight! my_model <-lm ( year[gender=="F"] ~ mean_weight[gender=="F"] )# y xabline(my_model, col=‘red’)

Challenge…

We made this plot previously - weight vs year (females in red, males in blue)Now, make one linear fit for each gender and display them in the same figure (example on next page)

attach(d)> plot (mean_weight, year )> lines (mean_weight[gender=="F"], year[gender=="F"], col="red" ) # will over-write plot> lines (mean_weight[gender=="M"], year[gender=="M"], col="blue" ) # will over-write plot> my_model <-lm ( year[gender=="F"] ~ mean_weight[gender=="F"] )> females_model <-lm ( year[gender=="F"] ~ mean_weight[gender=="F"] )> males_model <-lm ( year[gender=="M"] ~ mean_weight[gender=="M"] )> abline(females_model, col="orange" )> abline(males_model, col="cyan" )

How good is a fit?

A whole science in itself!Here, we will focus on R2:R=Pearson Correlation coefficient from before

(-1 to +1)R2 is just R squared (so, from 0 to 1).We can access these, and a LOT of other

things, by saying summary(my_fitting)

A horrible amount of data…Call:lm(formula = year[gender == "F"] ~ mean_weight[gender == "F"])

Residuals: 1 2 3 4 5 6 7 -2.3856 2.4661 -4.0161 -0.7568 -1.2754 0.7246 5.2432

Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1616.551 66.311 24.378 2.17e-06 ***mean_weight[gender == "F"] 6.297 1.102 5.715 0.00229 ** ---Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 3.409 on 5 degrees of freedomMultiple R-Squared: 0.8672, Adjusted R-squared: 0.8407 F-statistic: 32.66 on 1 and 5 DF, p-value: 0.002293

Count statistics

• Often, we have ONLY categorical data, such as “number of genes of category Y”

• Typically, these situations is where you have table where all the cells adds up to all cases or experiments

• This is called a contingency table, or a matrix

Matrices• A matrix is almost the same as a data frame -

a table• Only difference is that all data are only ONE

DATATYPE - only numbers, only characters or only logicals

• Here, we will mostly deal with numbers• How to create a matrix from some values:

– matrix(c(2,10, 5, 8,7,4), nrow=2)

Number of rows

matrix(c(2,10, 5, 8,7,4), nrow=2) [,1] [,2] [,3][1,] 2 5 7[2,] 10 8 4

By default, fills in the matrix column by column…

> matrix(c(2,10, 5, 8,7,4), nrow=2, byrow=T) [,1] [,2] [,3][1,] 2 10 5[2,] 8 7 4

But can do it by row also!

Want an extra row or column?

• The cbind() command adds an extra column to a matrix. Very handy at times! It can be understood as “bind a column to my matrix”. So, if we want to add another column to our matrix:

> m<- matrix(c(2,10, 5, 8,7,4), nrow=2, byrow=T) # just as before> m [,1] [,2] [,3][1,] 2 10 5[2,] 8 7 4> m2<-cbind(m, c(9,7))> m2 [,1] [,2] [,3] [,4][1,] 2 10 5 9[2,] 8 7 4 7

Challenge• 1• Figure out what the rbind() command does compared to cbind()• Using these two commands, go from this matrix[,1] [,2] [,3] [1,] 2 10 5 • [2,] 8 7 4

• To this [,1] [,2] [,3] [,4][1,] 2 10 5 9[2,] 8 7 4 7[3,] 2 1 4 5

•2•If your final matrix is called m3, test the following:•rbind(m3, 1);•rbind(m3, 1:3)•What is happening and why?

Albin’s try

• m2<-cbind(m, c(9,7)) # add the column=a vector of size 2

• > m2

• [,1] [,2] [,3] [,4]

• [1,] 2 10 5 9

• [2,] 8 7 4 7

• > m3<-rbind(m2, c(2,1,4,5)) # add a row - this is now of size 4, since we added a column above

• > m3

• [,1] [,2] [,3] [,4]

• [1,] 2 10 5 9

• [2,] 8 7 4 7

• [3,] 2 1 4 5

Cont…

> rbind(m3, 1); [,1] [,2] [,3] [,4][1,] 2 10 5 9[2,] 8 7 4 7[3,] 2 1 4 5[4,] 1 1 1 1> rbind(m3, 1:3) [,1] [,2] [,3] [,4][1,] 2 10 5 9[2,] 8 7 4 7[3,] 2 1 4 5[4,] 1 2 3 1Warning message:number of columns of result

is not a multiple of vector length (arg 2) in: rbind(1, m3, 1:3)

Dangerous feature of many R commands - the vector you supply is reused if possible!! In example 1, it is used times. In example 2 is is used 1.33 times (hence the warning)

Indexes in matrices

• Works exactly as in data frames!

• m2[x, y] gives the cell in the x row and y column

• m2[, y] gives the whole column y

• Etc…

We can also add annotation to the matrix…

Often useful to give rows and columns names, similarly to a dataframe

> m<- matrix(c(2,10, 5, 8,7,4), nrow=2, byrow=T, dimnames = list(c("first_player", "second_player"), c("score1", "score2", "score3")))> m score1 score2 score3first_player 2 10 5second_player 8 7 4

The list() construct we have not used so far -view it just as syntax at this stageThis matrix is called a 3x2 contingency table

2x2 contingency table - a common situation in biology

• Say that we have two types of genes, and two types of annotation features, for instance “related to cancer”. Is one of the groups more cancer-related?

Cancer-related Not cancer-related

Genes of type A 40 60

Genes of type B 6 4

A (40%) vs B(60%)…but the B group is very small. Are the proportions significantly different - could this arise just by chance?

Important!

• Contingency tables must have MUTUALLY EXCLUSIVE cells!

• As in all other test, we are assuming that samples are independently drawn!

Fishers Exact Test• Tests “proportions” in a 2x2 table by permutations

• Actually, the null hypothesis is “no relation between

rows and columns”my_matrix<-matrix(c(40, 60, 6, 4), nrow=2, byrow=T)> fisher.test(my_matrix)

Fisher's Exact Test for Count Data

data: my_matrix p-value = 0.3152alternative hypothesis: true odds ratio is not equal to 1 95 percent confidence interval: 0.08719973 2.02538340 sample estimates:odds ratio

0.4478183

NOT significant…Maybe the size of the B set is too small?

• Test the following tables

Cancer-related Not cancer-related

Genes of type A 40 60

Genes of type B 60 40

Cancer-related Not cancer-related

Genes of type A 5100 4900

Genes of type B 4900 5900

• Test the following tables

Cancer-related Not cancer-related

Genes of type A 40 60

Genes of type B 60 40

Cancer-related Not cancer-related

Genes of type A 5100 4900

Genes of type B 4900 5900

P=0.00706

P<2.2 E-16

Bonus take-home message

• Significance is due to– Actual signal (like 60% vs 40%) strength– How much data you have

• So, you can get significance by– Having big signal on small data– Having weak signal but lots of data– Or some combination

• This is an issue with genome-wide data, where you often have TONS of data

Going from typical vectors to contingency tables

• Say that we have typical data files with categorical data (“female”, “male” )

Gender Car_ownerM NF YM YF YM NM YM Y• We often want to reshape this into a contingency table: how many

males have a car, how many do not, etc

Owns car(Y) Does not own car (N)

Female(F) 2 0

Male(M) 3 2

The table() function

…makes this very easy

> gender<-c("M", "F", "M", "F", "M“, “M”, “M”)> car_owner<-c("N", "Y", "Y", "Y", "N“, “Y”, “Y”)>table(gender, car_owner)# important: gender and car should not be continuous values

car_ownergender N Y F 0 2 M 2 3

The chi-square test

• chisq.test()• Similar to Fisher Exact test, but less

computationally intensive: we can use if for any size of a matrix.

• Sensitive to small counts, especially zeroes - will warn about this

• Almost indentical in usage to fisher.test()

Let’s add some dimensions…

Is the blood group distribution significantly different between Europeans and Africans?

Test this with both fisher.test and chisq.test

Blood group Africa Europe

A 58 90B 34 16AB 9 80 100 86

# horrible mistake ahead…can you spot it?> a<-matrix(c(58, 34, 9, 100 , 90, 16, 8, 86), nrow=2 )

> chisq.test(a)

Pearson's Chi-squared test

data: a X-squared = 194.352, df = 3, p-value < 2.2e-16

> a<-matrix(c(58, 34, 9, 100 , 90, 16, 8, 86), ncol=2 )> a [,1] [,2][1,] 58 90[2,] 34 16[3,] 9 8[4,] 100 86> chisq.test(a)

Pearson's Chi-squared testX-squared = 14.5091, df = 3, p-value = 0.002288

data: a

> fisher.test(a)

Fisher's Exact Test for Count Data

data: a p-value = 0.002048alternative hypothesis: two.sided>

When to use Fisher/Chi?

• Fisher exact test is “exact” - it tests all possible permutations, while chi-squre is an approximation

• Chi-square has problems with 0 count bins• Chi-square is less computationally

demanding - fisher will run out your memory if the set is too large

Some handy functions on matrices

Apply()Applies a function to all rows or columns

in a matrix: works like thisapply(X, MARGIN, FUN, ...)X is the matrixMargin is rows(1=rows, 2=columns,

c(1,2)=both)FUN is the function, like “mean”

Some examples> a [,1] [,2][1,] 58 90[2,] 34 16[3,] 9 8[4,] 100 86

> apply(a, 2, sum)[1] 201 200

> apply(a, 1, sum)[1] 148 50 17 186

Note that we lost the order between columns here!

Column-wise

Row-wise

> apply(a, 2, sort) [,1] [,2][1,] 9 8[2,] 34 16[3,] 58 86[4,] 100 90

apply(a, 1, sort) [,1] [,2] [,3] [,4][1,] 58 16 8 86[2,] 90 34 9 100

Writing custom functions in R

• As you have found out, sometimes we have to write a lot in R to get things done

• We often save the things you write in a separate file and do copy paste

• Sometimes this gets awkward – especially if you are repeating something very many times with small changes

Example 1

• Say that we have a 30 experiments, where each is a vector in a dataframe

• We want to have the mean, max and min value of each of those, and plot the distribution of the values

• That’s a LOT of typing• Instead, we’ll write a function that does

this for us

• Defining your own function in R– Lets make a function that calculates

X2 +x+1 for any given x

my_function<- function(x) {

x^2 + x + 1

}

Function nameTelling R that we define a function

Argument to the function

The “code” of the function – anything within the { }

#define it

my_function<- function(x) {

x^2 + x + 1

}

# run it

my_function(10)

[1] 111

my_function(5)

[1] 31

challenges

1) Remake the function so that it calculates x3+x2+x

2) Make a new function that calculates the sum 1+2+3…etc up to x

3) Remake this function to calculate the sum from y to x (where y also is an input)

> my_function<- function(x) { x^3 +x^2 + x + 1 }> # run it> my_function(10)[1] 1111

> my_function<- function(x) { sum(1:x) }> # run it> my_function(10)[1] 55

> my_function<- function(x,y) { sum(y:x) }> # run it> my_function(5,10)[1] 55

> my_function<- function(y, x) { sum(y:x) }>

> my_function(10)Error in my_function(10) : element 1 is empty; the part of the args list of ':' being evaluated was: (y, x)> my_function(5,10)[1] 45> my_function(9,10)[1] 19

Arguments• Sometimes we want default inputs• We can set defaults to arguments, like this

> my_function<- function(y=5,x=10){

sum(y:x)

}

> my_function()

[1] 45

> my_function(5,10)

[1] 45

> my_function(5,11)

[1] 56

Iterators/loops

Often, we want to “walk over” a set of numbers, a set of vectors, and repeat an analysis

Like applying the mean_and_max function to 30 experiments,one by one

Or, in the flowers.txt file, to plot all columns, each in a separate picture

The for() loop

Best described by example – this will print all values in the vector 1:5

for(i in 1:5){

print(i)

}

[1] 1

[1] 2

[1] 3

[1] 4

[1] 5

Challenge

• Rewrite this program to write out the square root of each value in the vector x

my_function<-function(x){

for(i in x){

print(i)

}

}

my_function<-function(x){

for(i in x){

print (sqrt(i))

}

}

Return valuesMost functions in R return a value (or

something more complex) instead of writhing on the screen

There are two ways of doing this

1)Per default, R will return the last statement evaluated

2)We can instead force returning something by saying return (whatever). If R returns something it will also end the function (!)

Examples

my_function<- function(x){

x^2 + x + 1

}

# run it

my_function(10)

[1] 111

my_function(5)

[1] 31

my_function<- function(x) {

my_val <- x^2 + x + 1

return (my_val)

}

# run it

my_function(10)

[1] 111

my_function(5)

[1] 31

Hardest challenge so far...

• Rewrite this program to RETURN – not print - a vector which is the square root values of x (where x also is a vector)

my_function<-function(x){

for(i in x){

print (sqrt(i))

}

}

#look at next slide before continuing

Why is this the wrong solution:

my_function<-function(x){

for(i in x){

return(sqrt(i))

}

}

my_function(c(2,3,4))

[1] 1.414214 # we should get three values

my_function<-function(x){

for(i in x){

return(sqrt(i))

}

}

We need to somehow build a new vector, and fill that in with the square roots as the for loop progresses

Hint: we can append some value to a vector by saying

my_vector<-c(my_vector, somevalue)

my_function<-function(x){ my_vector<-c(); # empty vector

for(i in x){my_value<-(sqrt(i))# append this to the vectormy_vector<-c(my_vector, my_value)# and repeat

}return(my_vector)

} my_function(c(2,3,4))[1] 1.414214 1.732051 2.000000

Extensions to R• R has many “add-on” packages for specific

analyses

• Some of these you already have, but they are not accessible unless you “ask” for them

• Others you need to download

• Bioconductor, for array analysis and more, is such an addon. We will use this later in the course

Testing out Lattice graphics

• Different type of plotting methods: basically a “conditional plot”

• This is a “rewrite of R’s standard “plot()” command – called “xyplot()”

Testing…

my_data<-read.table("flowers.txt", h=T)names(my_data)attach(my_data)# complex command ahead!xyplot(Sepal.Length~Sepal.Width|Species )

# wont work! Why?

Need to “activate” the library first

• library(lattice)

xyplot(Sepal.Length~Sepal.Width|Species )

help(package=“lattice”) will give an overview

xyplot(Sepal.Length~Sepal.Width|Species )

Interpreted asPlot length as a function of width, broken up by species

Free-form exercise/homework

Using reference sheet and exploration of the drop-down menus in R, figure out how to get, install and run the barplot2() command in the gplots package (it is NOT in your computer)

Now what?• With the commands and concepts we have

looked at so far, you can do quite a lot• But, we have only scratched the R surface• Best to view the content so far as a “driving

license”: now you are capable of exploring by yourself

• We will now go over to use R in a genomics context

top related