excercises in basic statistics using r, teacher

20

Click here to load reader

Upload: uzama

Post on 30-May-2018

214 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Excercises in Basic Statistics Using R, TEACHER

8/9/2019 Excercises in Basic Statistics Using R, TEACHER

http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 1/20

Microarray data analysis, module IV Spring 2009

Exercises in basic statistics using R 

(with suggestions to solutions)

1 Introduction

1.1 What is R?

 R is not simply a program but actually a complete programming environment,

specifically tailored to suit the needs of statisticians with skills in computer 

 programming. Although this means that it may not be particularly user friendly, it

makes R extremely flexible, expandable and sharable. Users can write new functions

and distribute the code to other users who can install these packages and run on their 

machines. As R is very popular among the scientific community that deal with

 problems that can be solved computationally, many of the latest and mostsophisticated algorithms are first made available as R code.

 Bioconductor is a development project, involving many contributors, that aims at

developing extension packages for the analysis of genomic data, with special

emphasis on DNA microarray data analysis. The latest version of  Bioconductor , 2.3,

was released in october 2008 and contains almost 300 different analysis packages,

more than 60 example data sets and almost 400 annotation packages.

1.2 Writing conventions used throughout practical exercises

Pull-down menus are referred to within brackets [ ] and in boldface. The highest

menu level is spelled with CAPITAL letters, whereas sub-levels will only have the

first letter capitalized. Here is an example of a top-level menu:

[FILE]

Here is one sub-level menu within the above menu:

[Source R code...]

R code is shown in red colour with Courrier New font and is NOT preceded by

the > sign, which is the prompt used in R to show where the next input is going to be,

in order to facilitate pasting into the editor or console window for execution. If a

command call extends over several lines subsequent lines will be indented by one tab.

Output from R is similarly displayed but coloured blue. Here is one example of how it

may look:

a <- 1a1

 Note that throughout the text, the terms command and function will be usedsynonymously.

 ___________________________________________________________________________________ 

Page 2: Excercises in Basic Statistics Using R, TEACHER

8/9/2019 Excercises in Basic Statistics Using R, TEACHER

http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 2/20

Microarray data analysis, module IV Spring 2009

1.3 Downloading and installing R 

Note: All the computers in the computer class of Biomedicum have R version 2.8.1,

as well as Bioconductor release 2.3, already installed so there is no need to take any

measures in this regard. The following information is given in case you'd like to

install R on your own computer.

The home page for the R is http://www.r-project.org, where compiled

versions are freely available for Windows, Mac OS X and Linux. It is also possible to

obtain the source code for porting to other systems. The base level of installation

includes plenty, but not all, of the functionality that is available. Specialized packages,

like the Bioconductor suite, can be downloaded from within the R software once the

 base installation is in place. Consult the R home page for detailed instructions

regarding the downloading and installation of the software for the particular platform

you intend to use. A useful note for Windows users is that the installation program

needs to be run with administrator privileges and the R folder has to be set up so that

all users have read/write/execution permission. This will ensure that new packagescan be added, when their use is required, by any user.

1.4 Running R 

 R can be run in a command line mode (for special applications and/or advanced users)

or through a much more user-friendly graphical user interface (GUI). To start in GUI

mode simply double-click the R2.8.1 icon on the desktop, which will bring up the Rconsole window shown in figure 1.

Figure 1: Starting R in GUI mode.

 ___________________________________________________________________________________ 

Page 3: Excercises in Basic Statistics Using R, TEACHER

8/9/2019 Excercises in Basic Statistics Using R, TEACHER

http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 3/20

Microarray data analysis, module IV Spring 2009

The cursor by the > prompt is where R code is typed in. R output that can be

displayed as text will be shown in the same window as used for input, whereas graphs

and images will be displayed in a new window, a so called graphics device (figure 2).

The contents of the graph window can be saved or copied to the clipboard as windows

metafile or in bitmap format using the [Save as] or [Copy to the clipboard] optionsfound under the [FILE] menu. It is possible to have many graphical devices open

simultaneously. Creating new graphical devices and switching between multiple ones

is accomplished using the windows() and dev.set() functions.

All the R code that has been inputted during an R session can also be saved, and later 

 be re-loaded into memory, using the [Save history...] and [Load history...] options of 

the [FILE] menu. Similarly, any data that has been saved into a symbolic variable can

also be saved and re-loaded but using the [Save Workspace] and [Load Workspace]

options instead, again found under the [FILE] menu.

Figure 2: Both input and output of text takes place in the console window, whereas graphs and images

are directed and shown in a graphics device window.

1.5 Installing additional packages

All the specialized packages that are contributed by the authors of  R are stored in so

called repositories on the Internet. At present there are four different repositories,

including one solely dedicated to the storage of  Bioconductor packages. To selectfrom which repository to download packages one can either use the

 ___________________________________________________________________________________ 

Page 4: Excercises in Basic Statistics Using R, TEACHER

8/9/2019 Excercises in Basic Statistics Using R, TEACHER

http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 4/20

Microarray data analysis, module IV Spring 2009

setRepositories() command or open the [PACKAGES] menu and choose

the [Select repositories...] option. The packages to be downloaded and installed can

then be specified either through the install.packages() command or the

[Install package(s)...] option of the [PACKAGES] menu. Before being shown the

list of packages that are available for download, the user is prompted to select the

mirror site from which to download the packages. Generally, it is advantageous toselect a site as close to where you are as possible. Upon selecting mirror site and the

desired packages R will monitor the downloading and installation procedure through

status messages echoed to the terminal window.

1.6 Loading packages into memory

In order not to use up all the internal memory of your computer  R does not load

extension packages into its memory by default upon starting. It is hence up to the user 

to upload packages when needed, which is either done using the library()function or the [Load package...] option of the [PACKAGES] menu.

1.7 Working directory

By default R is setup to use the installation folder to be used as working directory.

This directory is where R stores setup preferences, looks for files to be imported and

saves files that are exported. Changing the working directory is done through the

[Change dir...] option of the [FILE] menu.

1.8 Getting help

Perhaps the most used function in R is the help() or ? command. If you do not

know the exact command name you can look for help using the apropos() function

instead, which lists all functions that include the specified text in its name. More

extensive help, including keyword search and links to R user and R reference manuals

can be found in the [HELP] pull-down menu. Within the help files there are usually

one or more examples of how the particular command can be used in practice. If you

want to run the examples yourself, without having to enter the R code by hand in the

console window, there is a very handy function called example() that will do it for 

you. Simply give the name of the function, or package, you are interested in asargument and all the examples associated to it will be run and the results displayed in

the console window (for textual outputs) or in a graphical device (for graphical

output). Setting the ask argument to TRUE will require the user to hit a key before

a graphical output is overwriting a previous output in the graphical device.

 ___________________________________________________________________________________ 

Page 5: Excercises in Basic Statistics Using R, TEACHER

8/9/2019 Excercises in Basic Statistics Using R, TEACHER

http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 5/20

Microarray data analysis, module IV Spring 2009

2 Exercises - input and output of data

2.1 Symbolic variables

Data in the form of single values, as well as more complex data structures (see

 paragraph 2.2 below), can be saved in the internal memory of  R for later use byassignment to symbolic variables. The "assignment" operator in R is spelled <-, the

arrowhead pointing towards the variable name and the tail towards the data to be

saved. To create a variable pi that contains the value 3.14 simply type:

 pi <- 3.14

The value of a variable can be displayed by evaluating it, i.e. by typing in its name

and press enter:

 pi[1] 3.14

Use the newly created variable to calculate the area of a circle with a diameter of 5

cm and store the results in the variable area. Print the result.

hint: the area of a circle is pi*r 2

area <- (5/2)*(5(2)*piarea[1] 19.625

As you see R can be used simply as a calculator (a very powerful one) but as you'll

notice in subsequent exercises it can do so much more than that.

To manage symbolic variables R provides a number of functions, of which perhaps

the most widely used are the ones for listing all variables in memory and for removing

them. Listing of variables can be done either by the objects() command or, like

Unix/Linux users will know, by the ls() command. Most (if not all) of the

commands in R will take some input in the form of arguments, which are to be put

within the parentheses attached to the command. Note that even if you do not wish to

 provide any arguments with the command you still need to add the parentheses, just

without contents. Displaying the current list of objects loaded in memory so far should yield:

objects()[1] "area" "pi"

Objects can be removed using the rm() function, giving the name of the variable to be

removed as argument (or a list of names in a vector).

• Try to remove the object area from the memory.

Hint: Use the help() or ? function to find out the details on how to use the rm()function

 ___________________________________________________________________________________ 

Page 6: Excercises in Basic Statistics Using R, TEACHER

8/9/2019 Excercises in Basic Statistics Using R, TEACHER

http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 6/20

Microarray data analysis, module IV Spring 2009

rm("area")

How do you do if you want to remove all variables in memory in one go?

Hint: feed the output of the objects() function to the list argument of the rm()command

rm(list=objects())objects()character(0) 

2.2 Loading more complex data into R 

Series of data can be loaded into memory by input from the command line with the

c() or the scan() commands, read from an external text file usingread.table(), input through the built-in data editor using edit()or uploaded

from the internal data repositories by the data() command. Note: the data editor is

devised to work with two-dimensional data sets only. There are also a couple of very

useful functions, seq() and rep() for creating regular or repeated sequences of 

numbers.

Try to create two variables, list1 and list2, using the first two methods mentioned

above, that contain the following values:

0 2 4 6 8 10 12 14 16 18 20

list1 <- c(0,2,4,6,8,10,12,14,16,18,20)list2 <- scan()1: 0 2 4 6 8 10 12 14 16 18 2012:Read 11 items

How would you use the seq() function to produce the same series of data in an

automated way?

list3 <- seq(0,20,2)

 Now let's have a look how the same data could be imported from an external data file.

In order to do that we need to first create that file, which can easily be done in a text

editor, such as Notepad, or a spreadsheet program, like Excel.

Use either Notepad or Excel to generate a file with the same data as above and use

the read.table() command to get the contents of the file into R.

Hint 1: make sure the file is saved as text

Hint 2: make sure the file is saved in the working directory of  R, or change the

working directory according to where you save the file

 ___________________________________________________________________________________ 

Page 7: Excercises in Basic Statistics Using R, TEACHER

8/9/2019 Excercises in Basic Statistics Using R, TEACHER

http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 7/20

Microarray data analysis, module IV Spring 2009

list4 <- read.table(file="file.txt")

2.3 Exporting data from  R

In order to output contents of variables, or analysis results, one can either use thewrite() or sink() commands.

Explore how they work by creating output files of the variables you just created.

write(list1, file="list1.txt")etc.

sink(file="lists.txt")list1list2

list3list4sink()

What advantage does sink() provide over the wrute() command?

Allows the collection of the output from numerous variables, or even functions, into a

single file.

For more complex data structures, like tables, there are more specialized export

functions. write.table() is a general function for writing tabular data to files,with argument options to suit a wide variety of layout and delimiter formats.

 ___________________________________________________________________________________ 

Page 8: Excercises in Basic Statistics Using R, TEACHER

8/9/2019 Excercises in Basic Statistics Using R, TEACHER

http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 8/20

Microarray data analysis, module IV Spring 2009

3 Exercises - descriptive statistics

3.1 Checking the distribution of data - histograms

There are many functions available in the base version of  R for displaying and

exploring the distribution or other characteristics of data. Let's have a closer look atsome of the most useful by applying them to some data.

As most statistical tests that are commonly used assume the data to follow a normal

distribution it might be informative to see how normally distributed data looks like.

Using the rnorm() function one can create normally distributed series of data of any

desired size and properties.

• Create a data set x consisting of 20 values, with a mean of 100 and a standard

deviation of 25, and use the hist() command to plot a frequency distribution of the

data.

x <- rnorm(20,mean=100,sd=25)hist(x)

• How can one achieve a plot that show the density, or proportion, instead of the

absolute counts on the Y-axis?

hist(x,freq=FALSE)

In order to see how far from the theoretical normal distribution the empirical data is,

 R provides the dnorm() function that generates density probabilities for normallydistributed data with user determined properties. In combination with the curve()function it is possible to overlay these theoretical densities on the histogram. Try the

following code, which should achieve a graph similar to the one shown in figure 3:

hist(x, col="blue", density=4, angle=60, freq=FALSE,xlim=c(0,200), main="Example of a histogram\nwith anormal curve overlaid",)

curve(dnorm(x,mean=100,sd=25),col="red", add=TRUE)

These simple lines of code are a good example of the power and flexibility of  R

,

letting us create and customize graphs to our liking. Using the main argument in the

curve() call has allowed the addition of a title on the top of the graph, the "\n"

operator inside the text indicating a new line feed so that the title is split into two

lines. The col argument is used to specify the colour of the plotted object, where as

setting the add argument to TRUE has forced the histogram to be plotted on top of 

the normal curve, rather than generating a new plot like is done by default. Finally,

the density and angle arguments have been used to control the way that the

histogram bars are filled with angled lines. To get a more complete list of all options

to control the way things are plotted, please take a look at the help page for the

generic plot() function.

 ___________________________________________________________________________________ 

Page 9: Excercises in Basic Statistics Using R, TEACHER

8/9/2019 Excercises in Basic Statistics Using R, TEACHER

http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 9/20

Microarray data analysis, module IV Spring 2009

0 50 100 150 200

   0 .   0   0

   0

   0 .   0

   0   5

   0 .   0

   1   0

   0 .   0

   1   5

Example of a histogramwith a normal curve overlaid

x

   d  n  o  r  m   (  x ,  m  e  a  n  =   1   0   0 ,  s   d  =   2   5   )

Figure 3: An example of how different plot elements can be

combined into the same graph using the add argument.

• Repeat the whole procedure for smaller and larger random series of normally

distributed data and see how it affects the resemblance to the theoretical distribution.

Also, experiment with the breaks argument of the hist() function to achieve the

most appropriate output for any given size of data series.

x <- rnorm(1000,mean=100,sd=25)hist(x, freq=F, breaks=20)curve(dnorm(x,mean=100,sd=25),add=T)

3.2 Checking the distribution of data - boxplots

A convenient way to summarise the properties of the data is to display it in a boxplot

using the boxplot() function. Apply the function to the largest of the series thatyou used for the histogram.

 boxplot(x)

• What additional information does the boxplot provide over the histogram?

The mean, the standard error of mean and outliers are clearly shown.

If one wants to make a comparisons of the distribution properties between two

different variables it can be useful to plot them side by side in the graph window. The

 par() function gives the user total control over the graph window layout, includingthe options to partition it into virtually any arrangement of panels. Use the mfrow

 ___________________________________________________________________________________ 

Page 10: Excercises in Basic Statistics Using R, TEACHER

8/9/2019 Excercises in Basic Statistics Using R, TEACHER

http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 10/20

Microarray data analysis, module IV Spring 2009

argument to split the window into two panels. Create a data set with skewed

distribution by log-transforming the variable and plot boxplots of it and its

transformation side by side.

lx <- log(x)

 par(mfrow=c(1,2)) boxplot(x, main="Normal") boxplot(lx, main="Log-transformed") par(mfrow=c(1,1))

The last line of code resets the plotting window so that future plots will again fill the

whole screen.

3.3 Checking the distribution of data - quantile-quantile plots

Perhaps the most revealing of the descriptive plots, when it comes to assessingdeviations from normality, is the quantile-quantile plot. It is constructed so that the

empirical cumulative distribution is plotted as a function of its theoretical counterpart.

The more the resulting curve deviates from the 45-degree diagonal the further the

empirical data is from the normal distribution.

• Use the qqnorm() function to create the plot for both the variables used in the

 boxplot example.

qqnorm(x)

Notice that the theoretical distribution is standardised around the mean of 0 and a

standard deviation of 1.

• How can you scale the empirical data to match the theoretical, thus enabling a more

meaningful assessment of the degree of normality?

Hint: the centre of the empirical distribution is shifted horizontally by the mean and

its width is inflated by a factor equal to the standard deviation.

scaledx <- (x-mean(x))/sd(x)

The diagonal representing perfect correlation between the two distributions can be

added using the qqline() command.

• Add the line to the plot using the qqline() command and try to specify

arguments that will cause it to be drawn 4 times thicker than normal in a dot-dash

fashion and in green colour.

qqline(x, lty="dotdash", lwd=4, col="green")

The line representing perfect agreement can also be added using the abline() function,

specifying a line with intercept=0 and a slope=1:

 ___________________________________________________________________________________ 

Page 11: Excercises in Basic Statistics Using R, TEACHER

8/9/2019 Excercises in Basic Statistics Using R, TEACHER

http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 11/20

Microarray data analysis, module IV Spring 2009

abline(0,1,col="blue", lwd=2)

• Compare this line by the one output by the qqline() function. Do you notice

something strange, and if so, why does it occur?

Hint: Check the help page of qqline().

The qqline is forced through the third and forth quantile of the empirical, and not

theoretical, data.

 ___________________________________________________________________________________ 

Page 12: Excercises in Basic Statistics Using R, TEACHER

8/9/2019 Excercises in Basic Statistics Using R, TEACHER

http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 12/20

Microarray data analysis, module IV Spring 2009

4 Exercises- statistical tests

The base version of R includes functions to perform both parametric and non-

 parametric tests. It is possible to test for difference between a group mean and a fixed

value (one sample test) or test for difference between means of two (or more) separate

groups (two-sample test). To see how they work we first need some real-lifeexperimental data, like these measurements on energy intake in KJ on 11 women

reported by Altman et al. (1991).

5260 5470 5640 6180 6390 6515 6805 7515 7515 8230 8770

• Store them in an appropriate variable and report some summary statistics using the

 mean(), sd(), quantile() and finally the summary() functions.

intake <-c(5260,5470,5640,6180,6390,6515,6805,7515,7515,8230,8770)

 mean(intake)[1] 6753.636d(intake)[1] 1142.123quantile(intake)

0% 25% 50% 75% 100%5260 5910 6515 7515 8770summary(intake)

Min. 1st Qu. Median Mean 3rd Qu. Max.5260 5910 6515 6754 7515 8770

4.1 One-sample tests

• Now use the t.test() function to answer the question whether the average

energy intake for these women deviates significantly (at a 5% significance level) for 

the recommended value of 7725 KJ.

t.test(intake,mu=7725)

One Sample t-test

data: intaket = -2.8208, df = 10, p-value = 0.01814alternative hypothesis: true mean is not equal to 772595 percent confidence interval:5986.348 7520.925

sample estimates: mean of x

6753.636

• How do you interpret the confidence interval in the reported results?

 ___________________________________________________________________________________ 

Page 13: Excercises in Basic Statistics Using R, TEACHER

8/9/2019 Excercises in Basic Statistics Using R, TEACHER

http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 13/20

Microarray data analysis, module IV Spring 2009

The true value of the mean is somewhere between the confidence limits.

• Now perform a similar but one-sided test to answer whether the average energy

intake for the group is significantly lower than the recommended intake of 7725 KJ.

t.test(intake,mu=7725, alternative="less")

One Sample t-test

data: intaket = -2.8208, df = 10, p-value = 0.009069alternative hypothesis: true mean is less than 772595 percent confidence interval:

-Inf 7377.781sample estimates:

 mean of x

6753.636 

• What happened to the significance of the test and why?

The difference became more significant because the allowed overlap region increases

when a one sided test is done.

• Try the non-parametric version of the t-test on the same data using the

wilcox.test() function.

wilcox.test(intake,mu=7725)

   Wilcoxon signed rank test with continuity correction

data: intake V = 8, p-value = 0.0293alternative hypothesis: true location is not equal to7725

 Warning message:

cannot compute exact p-value with ties in:wilcox.test.default(intake, mu = 7725) 

• How did this affect the p-values and what does it indicate about the sensitivity of the

non-parametric compared to the parametric tests?

The p-values increased, implying that the non-parametric test is slightly less sensitive

• Did you notice something strange in the results?

A warning was given due to ties occurring in the data.

• When would it not be an ideal choice to use non-parametric tests?

 ___________________________________________________________________________________ 

Page 14: Excercises in Basic Statistics Using R, TEACHER

8/9/2019 Excercises in Basic Statistics Using R, TEACHER

http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 14/20

Microarray data analysis, module IV Spring 2009

1. when there are many ties in the data

2. when there are very few (less than 6) data points

4. 2 Two-sample tests

Let's upload one of the built-in data sets to illustrate the tests for difference between

two groups. There is a data set from the same report by Altman et al. that is included

in the ISwR  package, which includes daily energy expenditure for a group of lean

and a group of obese women. To upload the data into the working memory of  R we

first need to install the ISwR package to the internal library of packages, as it is not a

 part of the base distribution of  R. This is done by following these steps:

1. Open the [Packages->Set CRAN mirror...] menu and select a download site

located close to Finland.

2. Open the [Packages->Install package(s)...] menu and select the ISwR package

from the pull-down menu.

3. Open the [Packages->Load package...] menu and select the ISwR package from

the pull-down menu.

4. Type data(energy) at the command line to prepare the data.

The data is now stored in the variable energy, but it is not quite ready to be analysed just yet. If you display the contents of the energy variable you'll notice that it contains

two columns with headers, one for the energy expenditure and one for the stature. In

order to access the contents of the columns in a convenient way we need to make the

variable a part of the R search path. We do that by typing;

attach(energy)

We can now access the expenditure and stature data simply by their header names,

which finally allow us to move on the actual statistical test.

• Use the formula interface in the argument list of t.test() to answer whether there is a statistically significant difference in the average energy expenditure between

lean and obese women.

t.test(expend~stature)

Welch Two Sample t-test

data: expend by staturet = -3.8555, df = 15.919, p-value = 0.001411alternative hypothesis: true difference in means is not

equal to 095 percent confidence interval:

 ___________________________________________________________________________________ 

Page 15: Excercises in Basic Statistics Using R, TEACHER

8/9/2019 Excercises in Basic Statistics Using R, TEACHER

http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 15/20

Microarray data analysis, module IV Spring 2009

-3.459167 -1.004081sample estimates:mean in group lean mean in group obese

8.066154 10.297778

• Repeat the test with the modification that the variances in the two groups areassumed to be equal.

t.test(expend~stature,var.equal=T)

Two Sample t-test

data: expend by staturet = -3.9456, df = 20, p-value = 0.000799alternative hypothesis: true difference in means is notequal to 0

95 percent confidence interval:-3.411451 -1.051796

sample estimates:mean in group lean mean in group obese

8.066154 10.297778

• What effect can be observed on the sensitivity of the test and why is it so?

Sensitivity increases as the between subject within-group variance is eliminated and

the difference is calculated pair-wise.

• Can you figure out an experimental design where it is reasonable to assume that

variances between groups are equal?

A typical situation is when the same subjects are sampled during two different

experimental conditions, like a group of patients before and after treatment

• Now repeat the test using the non-parametric wilcox.test() instead.

wilcox.test(expend~stature)

Wilcoxon rank sum test with continuity correction

data: expend by stature W = 12, p-value = 0.002122alternative hypothesis: true location shift is not equalto 0

 Warning message:cannot compute exact p-value with ties in:wilcox.test.default(x = c(7.53, 7.48, 8.08, 8.09, 10.15,8.4, 

 ___________________________________________________________________________________ 

Page 16: Excercises in Basic Statistics Using R, TEACHER

8/9/2019 Excercises in Basic Statistics Using R, TEACHER

http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 16/20

Microarray data analysis, module IV Spring 2009

4.3 Testing for equality of variances

If you are unsure if it is appropriate or not to pick the more sensitive tests, which

assume equal variances between groups, you may wish to formally test whether the

variances of two groups being compared are significantly different using either the

 var.test() or  bartlett.test() functions.

• Try them out on the energy data set.

var.test(expend~stature)

F test to compare two variances

data: expend by stature

F = 0.7844, num df = 12, denom df = 8, p-value = 0.6797

alternative hypothesis: true ratio of variances is not

equal to 1

95 percent confidence interval:

0.1867876 2.7547991

sample estimates:

ratio of variances

0.784446

 bartlett.test(expend~stature)

Bartlett test of homogeneity of variances

data: expend by statureBartlett's K-squared = 0.1362, df = 1, p-value = 0.712

• How do you interpret the results?

They give similar results, that there is no evidence for the variances not being equal.

• What limitations does the var.test() have compared to the

 bartlett.test() function?

Bartlett's test can be applied to experimental designs with more than two groups.

4.4 Power and sample size calculations

The power, or the ability of a statistical test to find a difference when there really is

one (i.e. the true positive rate) depends on four factors: the difference in means, the

variance, the false positive rate and the sample size. In R the power.t.test() can

 be used to calculate any of these quantities for any combination of the others. The

same function can be used for one-sample and two-sample scenarios, for the latter in

 both paired and non-paired versions, using the type argument.

 ___________________________________________________________________________________ 

Page 17: Excercises in Basic Statistics Using R, TEACHER

8/9/2019 Excercises in Basic Statistics Using R, TEACHER

http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 17/20

Microarray data analysis, module IV Spring 2009

Before we can insert the values into the function call we need to estimate the standard

deviation for the two stature groups individually and then calculate the average of 

them. The expenditure values for the lean and obese women can be extracted through:

lean_sd <- sd (subset(expend, stature=="lean"))

obese_sd <- sd (subset(expend, stature=="obese"))

• Calculate the average of the standard deviations and find out how many samples

would be needed (in each group) to find a difference in means of 2, for a power of 

80% and a 5% level for statistical significance.

average_sd <- (lean_sd+obese_sd)/2 power.t.test (delta=2, sd=average_sd, sig.level=0.05,

 power=0.8)

Two-sample t test power calculation

n = 7.903191delta = 2

sd = 1.317976sig.level = 0.05

power = 0.8alternative = two.sided 

NOTE: n is number in *each* group

• What is the smallest difference in means we can expect to identify assuming a

sample size of 10, the variance we calculated above, 5% significance level and 80%

 power?

 power.t.test (n=10, sd=average_sd, sig.level=0.05, power=0.8)

  Two-sample t test power calculation

n = 10delta = 1.746246

sd = 1.317976sig.level = 0.05

power = 0.8alternative = two.sided 

NOTE: n is number in *each* group 

• How much larger would the sample size have to be in the previous example if we

want to use a more conservative significance level of 1%?

 power.t.test (delta=1.746246, sd=average_sd,

sig.level=0.01, power=0.8)

 ___________________________________________________________________________________ 

Page 18: Excercises in Basic Statistics Using R, TEACHER

8/9/2019 Excercises in Basic Statistics Using R, TEACHER

http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 18/20

Microarray data analysis, module IV Spring 2009

Two-sample t test power calculation

n = 15.04787delta = 1.746246

sd = 1.317976

sig.level = 0.01power = 0.8

alternative = two.sided 

NOTE: n is number in *each* group

• Using the data in the last example, how much would the power improve if we could

use the paired version of the t test instead?

 power.t.test (n=15.04787, delta=1.746246, sd=average_sd,sig.level=0.01, type="paired")

  Paired t test power calculation

n = 15.04787delta = 1.746246

sd = 1.317976sig.level = 0.01

power = 0.973021alternative = two.sided 

NOTE: n is number of *pairs*, sd is std.dev. of

*differences* within pairs 

Let's finish off with a longer piece of code just to show what R is capable of with a

little effort. Just copy the following lines and paste them into the console window to

run the code. If you happen to be an avid programmer you may not find the code very

 pretty, but the point is not to show off beautiful code. Rather, it shows that even

somebody without any real programming experience through some relatively simple

lines of code can achieve a rather complex plot (see figure 4).

• Read the code and see if you understand how it works.

• Make changes to parts of the code and see how it affects the output.

• Please feel free to improve on the code!

 ___________________________________________________________________________________ 

Page 19: Excercises in Basic Statistics Using R, TEACHER

8/9/2019 Excercises in Basic Statistics Using R, TEACHER

http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 19/20

Microarray data analysis, module IV Spring 2009

# setup variablesx <- numeric (100)

 power <- numeric (100)n=2delta=0

delta_step <- 0.1# loop through delta values from 0 to 100*delta_stepfor (loop in 1:100){

result <- power.t.test (delta=delta, n=n,sd=average_sd, sig.level=0.05)

 power [loop] <- result$powerx[loop] <- deltadelta <- delta+delta_step

}# make the plot for n=2, scale axis, set axis labels

# and title plot (x, power, pch=20, ylim=c(0,1.1), main=paste("Powerfor sample sizes 2-8\nsd = ", average_sd),xlab="Difference in mean", ylab="Power")# add the additional plots for sample sizes 3 to 8for (n in 3:8){

delta=0# loop through delta values from 0 to 100*delta_stepfor (loop in 1:100){

result <- power.t.test (delta=delta, n=n,sd=average_sd, sig.level=0.05)

 power [loop] <- result$powerx[loop] <- deltadelta <- delta+delta_step

}# add points to original plot

 points (x, power, pch=20, col=n-1)}# add legend to plotlegend (pch=20, legend=paste

(c("2","3","4","5","6","7","8"), " samples/group"),col=1:8, x=7, y=0.4)

 ___________________________________________________________________________________ 

Page 20: Excercises in Basic Statistics Using R, TEACHER

8/9/2019 Excercises in Basic Statistics Using R, TEACHER

http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 20/20

Microarray data analysis, module IV Spring 2009

0 2 4 6 8 10

   0 .   0

   0 .   2

   0 .   4

   0 .   6

   0 .   8

   1 .   0

Power for sample sizes 2-8sd = 1

Difference in mean

   P  o  w  e  r

2 samples/group3 samples/group4 samples/group5 samples/group6 samples/group7 samples/group8 samples/group

2 samples/group3 samples/group4 samples/group5 samples/group6 samples/group7 samples/group8 samples/group

2 samples/group3 samples/group4 samples/group5 samples/group6 samples/group7 samples/group8 samples/group

2 samples/group3 samples/group4 samples/group5 samples/group6 samples/group7 samples/group8 samples/group

2 samples/group3 samples/group4 samples/group5 samples/group6 samples/group7 samples/group8 samples/group

2 samples/group3 samples/group4 samples/group5 samples/group6 samples/group7 samples/group8 samples/group

Figure 4: An example of the output of some more advanced R coding.

THE END !