excercises in basic statistics using r, teacher
TRANSCRIPT
![Page 1: Excercises in Basic Statistics Using R, TEACHER](https://reader038.vdocument.in/reader038/viewer/2022100515/577d38291a28ab3a6b973229/html5/thumbnails/1.jpg)
8/9/2019 Excercises in Basic Statistics Using R, TEACHER
http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 1/20
Microarray data analysis, module IV Spring 2009
Exercises in basic statistics using R
(with suggestions to solutions)
1 Introduction
1.1 What is R?
R is not simply a program but actually a complete programming environment,
specifically tailored to suit the needs of statisticians with skills in computer
programming. Although this means that it may not be particularly user friendly, it
makes R extremely flexible, expandable and sharable. Users can write new functions
and distribute the code to other users who can install these packages and run on their
machines. As R is very popular among the scientific community that deal with
problems that can be solved computationally, many of the latest and mostsophisticated algorithms are first made available as R code.
Bioconductor is a development project, involving many contributors, that aims at
developing extension packages for the analysis of genomic data, with special
emphasis on DNA microarray data analysis. The latest version of Bioconductor , 2.3,
was released in october 2008 and contains almost 300 different analysis packages,
more than 60 example data sets and almost 400 annotation packages.
1.2 Writing conventions used throughout practical exercises
Pull-down menus are referred to within brackets [ ] and in boldface. The highest
menu level is spelled with CAPITAL letters, whereas sub-levels will only have the
first letter capitalized. Here is an example of a top-level menu:
[FILE]
Here is one sub-level menu within the above menu:
[Source R code...]
R code is shown in red colour with Courrier New font and is NOT preceded by
the > sign, which is the prompt used in R to show where the next input is going to be,
in order to facilitate pasting into the editor or console window for execution. If a
command call extends over several lines subsequent lines will be indented by one tab.
Output from R is similarly displayed but coloured blue. Here is one example of how it
may look:
a <- 1a1
Note that throughout the text, the terms command and function will be usedsynonymously.
___________________________________________________________________________________
![Page 2: Excercises in Basic Statistics Using R, TEACHER](https://reader038.vdocument.in/reader038/viewer/2022100515/577d38291a28ab3a6b973229/html5/thumbnails/2.jpg)
8/9/2019 Excercises in Basic Statistics Using R, TEACHER
http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 2/20
Microarray data analysis, module IV Spring 2009
1.3 Downloading and installing R
Note: All the computers in the computer class of Biomedicum have R version 2.8.1,
as well as Bioconductor release 2.3, already installed so there is no need to take any
measures in this regard. The following information is given in case you'd like to
install R on your own computer.
The home page for the R is http://www.r-project.org, where compiled
versions are freely available for Windows, Mac OS X and Linux. It is also possible to
obtain the source code for porting to other systems. The base level of installation
includes plenty, but not all, of the functionality that is available. Specialized packages,
like the Bioconductor suite, can be downloaded from within the R software once the
base installation is in place. Consult the R home page for detailed instructions
regarding the downloading and installation of the software for the particular platform
you intend to use. A useful note for Windows users is that the installation program
needs to be run with administrator privileges and the R folder has to be set up so that
all users have read/write/execution permission. This will ensure that new packagescan be added, when their use is required, by any user.
1.4 Running R
R can be run in a command line mode (for special applications and/or advanced users)
or through a much more user-friendly graphical user interface (GUI). To start in GUI
mode simply double-click the R2.8.1 icon on the desktop, which will bring up the Rconsole window shown in figure 1.
Figure 1: Starting R in GUI mode.
___________________________________________________________________________________
![Page 3: Excercises in Basic Statistics Using R, TEACHER](https://reader038.vdocument.in/reader038/viewer/2022100515/577d38291a28ab3a6b973229/html5/thumbnails/3.jpg)
8/9/2019 Excercises in Basic Statistics Using R, TEACHER
http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 3/20
Microarray data analysis, module IV Spring 2009
The cursor by the > prompt is where R code is typed in. R output that can be
displayed as text will be shown in the same window as used for input, whereas graphs
and images will be displayed in a new window, a so called graphics device (figure 2).
The contents of the graph window can be saved or copied to the clipboard as windows
metafile or in bitmap format using the [Save as] or [Copy to the clipboard] optionsfound under the [FILE] menu. It is possible to have many graphical devices open
simultaneously. Creating new graphical devices and switching between multiple ones
is accomplished using the windows() and dev.set() functions.
All the R code that has been inputted during an R session can also be saved, and later
be re-loaded into memory, using the [Save history...] and [Load history...] options of
the [FILE] menu. Similarly, any data that has been saved into a symbolic variable can
also be saved and re-loaded but using the [Save Workspace] and [Load Workspace]
options instead, again found under the [FILE] menu.
Figure 2: Both input and output of text takes place in the console window, whereas graphs and images
are directed and shown in a graphics device window.
1.5 Installing additional packages
All the specialized packages that are contributed by the authors of R are stored in so
called repositories on the Internet. At present there are four different repositories,
including one solely dedicated to the storage of Bioconductor packages. To selectfrom which repository to download packages one can either use the
___________________________________________________________________________________
![Page 4: Excercises in Basic Statistics Using R, TEACHER](https://reader038.vdocument.in/reader038/viewer/2022100515/577d38291a28ab3a6b973229/html5/thumbnails/4.jpg)
8/9/2019 Excercises in Basic Statistics Using R, TEACHER
http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 4/20
Microarray data analysis, module IV Spring 2009
setRepositories() command or open the [PACKAGES] menu and choose
the [Select repositories...] option. The packages to be downloaded and installed can
then be specified either through the install.packages() command or the
[Install package(s)...] option of the [PACKAGES] menu. Before being shown the
list of packages that are available for download, the user is prompted to select the
mirror site from which to download the packages. Generally, it is advantageous toselect a site as close to where you are as possible. Upon selecting mirror site and the
desired packages R will monitor the downloading and installation procedure through
status messages echoed to the terminal window.
1.6 Loading packages into memory
In order not to use up all the internal memory of your computer R does not load
extension packages into its memory by default upon starting. It is hence up to the user
to upload packages when needed, which is either done using the library()function or the [Load package...] option of the [PACKAGES] menu.
1.7 Working directory
By default R is setup to use the installation folder to be used as working directory.
This directory is where R stores setup preferences, looks for files to be imported and
saves files that are exported. Changing the working directory is done through the
[Change dir...] option of the [FILE] menu.
1.8 Getting help
Perhaps the most used function in R is the help() or ? command. If you do not
know the exact command name you can look for help using the apropos() function
instead, which lists all functions that include the specified text in its name. More
extensive help, including keyword search and links to R user and R reference manuals
can be found in the [HELP] pull-down menu. Within the help files there are usually
one or more examples of how the particular command can be used in practice. If you
want to run the examples yourself, without having to enter the R code by hand in the
console window, there is a very handy function called example() that will do it for
you. Simply give the name of the function, or package, you are interested in asargument and all the examples associated to it will be run and the results displayed in
the console window (for textual outputs) or in a graphical device (for graphical
output). Setting the ask argument to TRUE will require the user to hit a key before
a graphical output is overwriting a previous output in the graphical device.
___________________________________________________________________________________
![Page 5: Excercises in Basic Statistics Using R, TEACHER](https://reader038.vdocument.in/reader038/viewer/2022100515/577d38291a28ab3a6b973229/html5/thumbnails/5.jpg)
8/9/2019 Excercises in Basic Statistics Using R, TEACHER
http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 5/20
Microarray data analysis, module IV Spring 2009
2 Exercises - input and output of data
2.1 Symbolic variables
Data in the form of single values, as well as more complex data structures (see
paragraph 2.2 below), can be saved in the internal memory of R for later use byassignment to symbolic variables. The "assignment" operator in R is spelled <-, the
arrowhead pointing towards the variable name and the tail towards the data to be
saved. To create a variable pi that contains the value 3.14 simply type:
pi <- 3.14
The value of a variable can be displayed by evaluating it, i.e. by typing in its name
and press enter:
pi[1] 3.14
Use the newly created variable to calculate the area of a circle with a diameter of 5
cm and store the results in the variable area. Print the result.
hint: the area of a circle is pi*r 2
area <- (5/2)*(5(2)*piarea[1] 19.625
As you see R can be used simply as a calculator (a very powerful one) but as you'll
notice in subsequent exercises it can do so much more than that.
To manage symbolic variables R provides a number of functions, of which perhaps
the most widely used are the ones for listing all variables in memory and for removing
them. Listing of variables can be done either by the objects() command or, like
Unix/Linux users will know, by the ls() command. Most (if not all) of the
commands in R will take some input in the form of arguments, which are to be put
within the parentheses attached to the command. Note that even if you do not wish to
provide any arguments with the command you still need to add the parentheses, just
without contents. Displaying the current list of objects loaded in memory so far should yield:
objects()[1] "area" "pi"
Objects can be removed using the rm() function, giving the name of the variable to be
removed as argument (or a list of names in a vector).
• Try to remove the object area from the memory.
Hint: Use the help() or ? function to find out the details on how to use the rm()function
___________________________________________________________________________________
![Page 6: Excercises in Basic Statistics Using R, TEACHER](https://reader038.vdocument.in/reader038/viewer/2022100515/577d38291a28ab3a6b973229/html5/thumbnails/6.jpg)
8/9/2019 Excercises in Basic Statistics Using R, TEACHER
http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 6/20
Microarray data analysis, module IV Spring 2009
rm("area")
How do you do if you want to remove all variables in memory in one go?
Hint: feed the output of the objects() function to the list argument of the rm()command
rm(list=objects())objects()character(0)
2.2 Loading more complex data into R
Series of data can be loaded into memory by input from the command line with the
c() or the scan() commands, read from an external text file usingread.table(), input through the built-in data editor using edit()or uploaded
from the internal data repositories by the data() command. Note: the data editor is
devised to work with two-dimensional data sets only. There are also a couple of very
useful functions, seq() and rep() for creating regular or repeated sequences of
numbers.
Try to create two variables, list1 and list2, using the first two methods mentioned
above, that contain the following values:
0 2 4 6 8 10 12 14 16 18 20
list1 <- c(0,2,4,6,8,10,12,14,16,18,20)list2 <- scan()1: 0 2 4 6 8 10 12 14 16 18 2012:Read 11 items
How would you use the seq() function to produce the same series of data in an
automated way?
list3 <- seq(0,20,2)
Now let's have a look how the same data could be imported from an external data file.
In order to do that we need to first create that file, which can easily be done in a text
editor, such as Notepad, or a spreadsheet program, like Excel.
Use either Notepad or Excel to generate a file with the same data as above and use
the read.table() command to get the contents of the file into R.
Hint 1: make sure the file is saved as text
Hint 2: make sure the file is saved in the working directory of R, or change the
working directory according to where you save the file
___________________________________________________________________________________
![Page 7: Excercises in Basic Statistics Using R, TEACHER](https://reader038.vdocument.in/reader038/viewer/2022100515/577d38291a28ab3a6b973229/html5/thumbnails/7.jpg)
8/9/2019 Excercises in Basic Statistics Using R, TEACHER
http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 7/20
Microarray data analysis, module IV Spring 2009
list4 <- read.table(file="file.txt")
2.3 Exporting data from R
In order to output contents of variables, or analysis results, one can either use thewrite() or sink() commands.
Explore how they work by creating output files of the variables you just created.
write(list1, file="list1.txt")etc.
sink(file="lists.txt")list1list2
list3list4sink()
What advantage does sink() provide over the wrute() command?
Allows the collection of the output from numerous variables, or even functions, into a
single file.
For more complex data structures, like tables, there are more specialized export
functions. write.table() is a general function for writing tabular data to files,with argument options to suit a wide variety of layout and delimiter formats.
___________________________________________________________________________________
![Page 8: Excercises in Basic Statistics Using R, TEACHER](https://reader038.vdocument.in/reader038/viewer/2022100515/577d38291a28ab3a6b973229/html5/thumbnails/8.jpg)
8/9/2019 Excercises in Basic Statistics Using R, TEACHER
http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 8/20
Microarray data analysis, module IV Spring 2009
3 Exercises - descriptive statistics
3.1 Checking the distribution of data - histograms
There are many functions available in the base version of R for displaying and
exploring the distribution or other characteristics of data. Let's have a closer look atsome of the most useful by applying them to some data.
As most statistical tests that are commonly used assume the data to follow a normal
distribution it might be informative to see how normally distributed data looks like.
Using the rnorm() function one can create normally distributed series of data of any
desired size and properties.
• Create a data set x consisting of 20 values, with a mean of 100 and a standard
deviation of 25, and use the hist() command to plot a frequency distribution of the
data.
x <- rnorm(20,mean=100,sd=25)hist(x)
• How can one achieve a plot that show the density, or proportion, instead of the
absolute counts on the Y-axis?
hist(x,freq=FALSE)
In order to see how far from the theoretical normal distribution the empirical data is,
R provides the dnorm() function that generates density probabilities for normallydistributed data with user determined properties. In combination with the curve()function it is possible to overlay these theoretical densities on the histogram. Try the
following code, which should achieve a graph similar to the one shown in figure 3:
hist(x, col="blue", density=4, angle=60, freq=FALSE,xlim=c(0,200), main="Example of a histogram\nwith anormal curve overlaid",)
curve(dnorm(x,mean=100,sd=25),col="red", add=TRUE)
These simple lines of code are a good example of the power and flexibility of R
,
letting us create and customize graphs to our liking. Using the main argument in the
curve() call has allowed the addition of a title on the top of the graph, the "\n"
operator inside the text indicating a new line feed so that the title is split into two
lines. The col argument is used to specify the colour of the plotted object, where as
setting the add argument to TRUE has forced the histogram to be plotted on top of
the normal curve, rather than generating a new plot like is done by default. Finally,
the density and angle arguments have been used to control the way that the
histogram bars are filled with angled lines. To get a more complete list of all options
to control the way things are plotted, please take a look at the help page for the
generic plot() function.
___________________________________________________________________________________
![Page 9: Excercises in Basic Statistics Using R, TEACHER](https://reader038.vdocument.in/reader038/viewer/2022100515/577d38291a28ab3a6b973229/html5/thumbnails/9.jpg)
8/9/2019 Excercises in Basic Statistics Using R, TEACHER
http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 9/20
Microarray data analysis, module IV Spring 2009
0 50 100 150 200
0 . 0 0
0
0 . 0
0 5
0 . 0
1 0
0 . 0
1 5
Example of a histogramwith a normal curve overlaid
x
d n o r m ( x , m e a n = 1 0 0 , s d = 2 5 )
Figure 3: An example of how different plot elements can be
combined into the same graph using the add argument.
• Repeat the whole procedure for smaller and larger random series of normally
distributed data and see how it affects the resemblance to the theoretical distribution.
Also, experiment with the breaks argument of the hist() function to achieve the
most appropriate output for any given size of data series.
x <- rnorm(1000,mean=100,sd=25)hist(x, freq=F, breaks=20)curve(dnorm(x,mean=100,sd=25),add=T)
3.2 Checking the distribution of data - boxplots
A convenient way to summarise the properties of the data is to display it in a boxplot
using the boxplot() function. Apply the function to the largest of the series thatyou used for the histogram.
boxplot(x)
• What additional information does the boxplot provide over the histogram?
The mean, the standard error of mean and outliers are clearly shown.
If one wants to make a comparisons of the distribution properties between two
different variables it can be useful to plot them side by side in the graph window. The
par() function gives the user total control over the graph window layout, includingthe options to partition it into virtually any arrangement of panels. Use the mfrow
___________________________________________________________________________________
![Page 10: Excercises in Basic Statistics Using R, TEACHER](https://reader038.vdocument.in/reader038/viewer/2022100515/577d38291a28ab3a6b973229/html5/thumbnails/10.jpg)
8/9/2019 Excercises in Basic Statistics Using R, TEACHER
http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 10/20
Microarray data analysis, module IV Spring 2009
argument to split the window into two panels. Create a data set with skewed
distribution by log-transforming the variable and plot boxplots of it and its
transformation side by side.
lx <- log(x)
par(mfrow=c(1,2)) boxplot(x, main="Normal") boxplot(lx, main="Log-transformed") par(mfrow=c(1,1))
The last line of code resets the plotting window so that future plots will again fill the
whole screen.
3.3 Checking the distribution of data - quantile-quantile plots
Perhaps the most revealing of the descriptive plots, when it comes to assessingdeviations from normality, is the quantile-quantile plot. It is constructed so that the
empirical cumulative distribution is plotted as a function of its theoretical counterpart.
The more the resulting curve deviates from the 45-degree diagonal the further the
empirical data is from the normal distribution.
• Use the qqnorm() function to create the plot for both the variables used in the
boxplot example.
qqnorm(x)
Notice that the theoretical distribution is standardised around the mean of 0 and a
standard deviation of 1.
• How can you scale the empirical data to match the theoretical, thus enabling a more
meaningful assessment of the degree of normality?
Hint: the centre of the empirical distribution is shifted horizontally by the mean and
its width is inflated by a factor equal to the standard deviation.
scaledx <- (x-mean(x))/sd(x)
The diagonal representing perfect correlation between the two distributions can be
added using the qqline() command.
• Add the line to the plot using the qqline() command and try to specify
arguments that will cause it to be drawn 4 times thicker than normal in a dot-dash
fashion and in green colour.
qqline(x, lty="dotdash", lwd=4, col="green")
The line representing perfect agreement can also be added using the abline() function,
specifying a line with intercept=0 and a slope=1:
___________________________________________________________________________________
![Page 11: Excercises in Basic Statistics Using R, TEACHER](https://reader038.vdocument.in/reader038/viewer/2022100515/577d38291a28ab3a6b973229/html5/thumbnails/11.jpg)
8/9/2019 Excercises in Basic Statistics Using R, TEACHER
http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 11/20
Microarray data analysis, module IV Spring 2009
abline(0,1,col="blue", lwd=2)
• Compare this line by the one output by the qqline() function. Do you notice
something strange, and if so, why does it occur?
Hint: Check the help page of qqline().
The qqline is forced through the third and forth quantile of the empirical, and not
theoretical, data.
___________________________________________________________________________________
![Page 12: Excercises in Basic Statistics Using R, TEACHER](https://reader038.vdocument.in/reader038/viewer/2022100515/577d38291a28ab3a6b973229/html5/thumbnails/12.jpg)
8/9/2019 Excercises in Basic Statistics Using R, TEACHER
http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 12/20
Microarray data analysis, module IV Spring 2009
4 Exercises- statistical tests
The base version of R includes functions to perform both parametric and non-
parametric tests. It is possible to test for difference between a group mean and a fixed
value (one sample test) or test for difference between means of two (or more) separate
groups (two-sample test). To see how they work we first need some real-lifeexperimental data, like these measurements on energy intake in KJ on 11 women
reported by Altman et al. (1991).
5260 5470 5640 6180 6390 6515 6805 7515 7515 8230 8770
• Store them in an appropriate variable and report some summary statistics using the
mean(), sd(), quantile() and finally the summary() functions.
intake <-c(5260,5470,5640,6180,6390,6515,6805,7515,7515,8230,8770)
mean(intake)[1] 6753.636d(intake)[1] 1142.123quantile(intake)
0% 25% 50% 75% 100%5260 5910 6515 7515 8770summary(intake)
Min. 1st Qu. Median Mean 3rd Qu. Max.5260 5910 6515 6754 7515 8770
4.1 One-sample tests
• Now use the t.test() function to answer the question whether the average
energy intake for these women deviates significantly (at a 5% significance level) for
the recommended value of 7725 KJ.
t.test(intake,mu=7725)
One Sample t-test
data: intaket = -2.8208, df = 10, p-value = 0.01814alternative hypothesis: true mean is not equal to 772595 percent confidence interval:5986.348 7520.925
sample estimates: mean of x
6753.636
• How do you interpret the confidence interval in the reported results?
___________________________________________________________________________________
![Page 13: Excercises in Basic Statistics Using R, TEACHER](https://reader038.vdocument.in/reader038/viewer/2022100515/577d38291a28ab3a6b973229/html5/thumbnails/13.jpg)
8/9/2019 Excercises in Basic Statistics Using R, TEACHER
http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 13/20
Microarray data analysis, module IV Spring 2009
The true value of the mean is somewhere between the confidence limits.
• Now perform a similar but one-sided test to answer whether the average energy
intake for the group is significantly lower than the recommended intake of 7725 KJ.
t.test(intake,mu=7725, alternative="less")
One Sample t-test
data: intaket = -2.8208, df = 10, p-value = 0.009069alternative hypothesis: true mean is less than 772595 percent confidence interval:
-Inf 7377.781sample estimates:
mean of x
6753.636
• What happened to the significance of the test and why?
The difference became more significant because the allowed overlap region increases
when a one sided test is done.
• Try the non-parametric version of the t-test on the same data using the
wilcox.test() function.
wilcox.test(intake,mu=7725)
Wilcoxon signed rank test with continuity correction
data: intake V = 8, p-value = 0.0293alternative hypothesis: true location is not equal to7725
Warning message:
cannot compute exact p-value with ties in:wilcox.test.default(intake, mu = 7725)
• How did this affect the p-values and what does it indicate about the sensitivity of the
non-parametric compared to the parametric tests?
The p-values increased, implying that the non-parametric test is slightly less sensitive
• Did you notice something strange in the results?
A warning was given due to ties occurring in the data.
• When would it not be an ideal choice to use non-parametric tests?
___________________________________________________________________________________
![Page 14: Excercises in Basic Statistics Using R, TEACHER](https://reader038.vdocument.in/reader038/viewer/2022100515/577d38291a28ab3a6b973229/html5/thumbnails/14.jpg)
8/9/2019 Excercises in Basic Statistics Using R, TEACHER
http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 14/20
Microarray data analysis, module IV Spring 2009
1. when there are many ties in the data
2. when there are very few (less than 6) data points
4. 2 Two-sample tests
Let's upload one of the built-in data sets to illustrate the tests for difference between
two groups. There is a data set from the same report by Altman et al. that is included
in the ISwR package, which includes daily energy expenditure for a group of lean
and a group of obese women. To upload the data into the working memory of R we
first need to install the ISwR package to the internal library of packages, as it is not a
part of the base distribution of R. This is done by following these steps:
1. Open the [Packages->Set CRAN mirror...] menu and select a download site
located close to Finland.
2. Open the [Packages->Install package(s)...] menu and select the ISwR package
from the pull-down menu.
3. Open the [Packages->Load package...] menu and select the ISwR package from
the pull-down menu.
4. Type data(energy) at the command line to prepare the data.
The data is now stored in the variable energy, but it is not quite ready to be analysed just yet. If you display the contents of the energy variable you'll notice that it contains
two columns with headers, one for the energy expenditure and one for the stature. In
order to access the contents of the columns in a convenient way we need to make the
variable a part of the R search path. We do that by typing;
attach(energy)
We can now access the expenditure and stature data simply by their header names,
which finally allow us to move on the actual statistical test.
• Use the formula interface in the argument list of t.test() to answer whether there is a statistically significant difference in the average energy expenditure between
lean and obese women.
t.test(expend~stature)
Welch Two Sample t-test
data: expend by staturet = -3.8555, df = 15.919, p-value = 0.001411alternative hypothesis: true difference in means is not
equal to 095 percent confidence interval:
___________________________________________________________________________________
![Page 15: Excercises in Basic Statistics Using R, TEACHER](https://reader038.vdocument.in/reader038/viewer/2022100515/577d38291a28ab3a6b973229/html5/thumbnails/15.jpg)
8/9/2019 Excercises in Basic Statistics Using R, TEACHER
http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 15/20
Microarray data analysis, module IV Spring 2009
-3.459167 -1.004081sample estimates:mean in group lean mean in group obese
8.066154 10.297778
• Repeat the test with the modification that the variances in the two groups areassumed to be equal.
t.test(expend~stature,var.equal=T)
Two Sample t-test
data: expend by staturet = -3.9456, df = 20, p-value = 0.000799alternative hypothesis: true difference in means is notequal to 0
95 percent confidence interval:-3.411451 -1.051796
sample estimates:mean in group lean mean in group obese
8.066154 10.297778
• What effect can be observed on the sensitivity of the test and why is it so?
Sensitivity increases as the between subject within-group variance is eliminated and
the difference is calculated pair-wise.
• Can you figure out an experimental design where it is reasonable to assume that
variances between groups are equal?
A typical situation is when the same subjects are sampled during two different
experimental conditions, like a group of patients before and after treatment
• Now repeat the test using the non-parametric wilcox.test() instead.
wilcox.test(expend~stature)
Wilcoxon rank sum test with continuity correction
data: expend by stature W = 12, p-value = 0.002122alternative hypothesis: true location shift is not equalto 0
Warning message:cannot compute exact p-value with ties in:wilcox.test.default(x = c(7.53, 7.48, 8.08, 8.09, 10.15,8.4,
___________________________________________________________________________________
![Page 16: Excercises in Basic Statistics Using R, TEACHER](https://reader038.vdocument.in/reader038/viewer/2022100515/577d38291a28ab3a6b973229/html5/thumbnails/16.jpg)
8/9/2019 Excercises in Basic Statistics Using R, TEACHER
http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 16/20
Microarray data analysis, module IV Spring 2009
4.3 Testing for equality of variances
If you are unsure if it is appropriate or not to pick the more sensitive tests, which
assume equal variances between groups, you may wish to formally test whether the
variances of two groups being compared are significantly different using either the
var.test() or bartlett.test() functions.
• Try them out on the energy data set.
var.test(expend~stature)
F test to compare two variances
data: expend by stature
F = 0.7844, num df = 12, denom df = 8, p-value = 0.6797
alternative hypothesis: true ratio of variances is not
equal to 1
95 percent confidence interval:
0.1867876 2.7547991
sample estimates:
ratio of variances
0.784446
bartlett.test(expend~stature)
Bartlett test of homogeneity of variances
data: expend by statureBartlett's K-squared = 0.1362, df = 1, p-value = 0.712
• How do you interpret the results?
They give similar results, that there is no evidence for the variances not being equal.
• What limitations does the var.test() have compared to the
bartlett.test() function?
Bartlett's test can be applied to experimental designs with more than two groups.
4.4 Power and sample size calculations
The power, or the ability of a statistical test to find a difference when there really is
one (i.e. the true positive rate) depends on four factors: the difference in means, the
variance, the false positive rate and the sample size. In R the power.t.test() can
be used to calculate any of these quantities for any combination of the others. The
same function can be used for one-sample and two-sample scenarios, for the latter in
both paired and non-paired versions, using the type argument.
___________________________________________________________________________________
![Page 17: Excercises in Basic Statistics Using R, TEACHER](https://reader038.vdocument.in/reader038/viewer/2022100515/577d38291a28ab3a6b973229/html5/thumbnails/17.jpg)
8/9/2019 Excercises in Basic Statistics Using R, TEACHER
http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 17/20
Microarray data analysis, module IV Spring 2009
Before we can insert the values into the function call we need to estimate the standard
deviation for the two stature groups individually and then calculate the average of
them. The expenditure values for the lean and obese women can be extracted through:
lean_sd <- sd (subset(expend, stature=="lean"))
obese_sd <- sd (subset(expend, stature=="obese"))
• Calculate the average of the standard deviations and find out how many samples
would be needed (in each group) to find a difference in means of 2, for a power of
80% and a 5% level for statistical significance.
average_sd <- (lean_sd+obese_sd)/2 power.t.test (delta=2, sd=average_sd, sig.level=0.05,
power=0.8)
Two-sample t test power calculation
n = 7.903191delta = 2
sd = 1.317976sig.level = 0.05
power = 0.8alternative = two.sided
NOTE: n is number in *each* group
• What is the smallest difference in means we can expect to identify assuming a
sample size of 10, the variance we calculated above, 5% significance level and 80%
power?
power.t.test (n=10, sd=average_sd, sig.level=0.05, power=0.8)
Two-sample t test power calculation
n = 10delta = 1.746246
sd = 1.317976sig.level = 0.05
power = 0.8alternative = two.sided
NOTE: n is number in *each* group
• How much larger would the sample size have to be in the previous example if we
want to use a more conservative significance level of 1%?
power.t.test (delta=1.746246, sd=average_sd,
sig.level=0.01, power=0.8)
___________________________________________________________________________________
![Page 18: Excercises in Basic Statistics Using R, TEACHER](https://reader038.vdocument.in/reader038/viewer/2022100515/577d38291a28ab3a6b973229/html5/thumbnails/18.jpg)
8/9/2019 Excercises in Basic Statistics Using R, TEACHER
http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 18/20
Microarray data analysis, module IV Spring 2009
Two-sample t test power calculation
n = 15.04787delta = 1.746246
sd = 1.317976
sig.level = 0.01power = 0.8
alternative = two.sided
NOTE: n is number in *each* group
• Using the data in the last example, how much would the power improve if we could
use the paired version of the t test instead?
power.t.test (n=15.04787, delta=1.746246, sd=average_sd,sig.level=0.01, type="paired")
Paired t test power calculation
n = 15.04787delta = 1.746246
sd = 1.317976sig.level = 0.01
power = 0.973021alternative = two.sided
NOTE: n is number of *pairs*, sd is std.dev. of
*differences* within pairs
Let's finish off with a longer piece of code just to show what R is capable of with a
little effort. Just copy the following lines and paste them into the console window to
run the code. If you happen to be an avid programmer you may not find the code very
pretty, but the point is not to show off beautiful code. Rather, it shows that even
somebody without any real programming experience through some relatively simple
lines of code can achieve a rather complex plot (see figure 4).
• Read the code and see if you understand how it works.
• Make changes to parts of the code and see how it affects the output.
• Please feel free to improve on the code!
___________________________________________________________________________________
![Page 19: Excercises in Basic Statistics Using R, TEACHER](https://reader038.vdocument.in/reader038/viewer/2022100515/577d38291a28ab3a6b973229/html5/thumbnails/19.jpg)
8/9/2019 Excercises in Basic Statistics Using R, TEACHER
http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 19/20
Microarray data analysis, module IV Spring 2009
# setup variablesx <- numeric (100)
power <- numeric (100)n=2delta=0
delta_step <- 0.1# loop through delta values from 0 to 100*delta_stepfor (loop in 1:100){
result <- power.t.test (delta=delta, n=n,sd=average_sd, sig.level=0.05)
power [loop] <- result$powerx[loop] <- deltadelta <- delta+delta_step
}# make the plot for n=2, scale axis, set axis labels
# and title plot (x, power, pch=20, ylim=c(0,1.1), main=paste("Powerfor sample sizes 2-8\nsd = ", average_sd),xlab="Difference in mean", ylab="Power")# add the additional plots for sample sizes 3 to 8for (n in 3:8){
delta=0# loop through delta values from 0 to 100*delta_stepfor (loop in 1:100){
result <- power.t.test (delta=delta, n=n,sd=average_sd, sig.level=0.05)
power [loop] <- result$powerx[loop] <- deltadelta <- delta+delta_step
}# add points to original plot
points (x, power, pch=20, col=n-1)}# add legend to plotlegend (pch=20, legend=paste
(c("2","3","4","5","6","7","8"), " samples/group"),col=1:8, x=7, y=0.4)
___________________________________________________________________________________
![Page 20: Excercises in Basic Statistics Using R, TEACHER](https://reader038.vdocument.in/reader038/viewer/2022100515/577d38291a28ab3a6b973229/html5/thumbnails/20.jpg)
8/9/2019 Excercises in Basic Statistics Using R, TEACHER
http://slidepdf.com/reader/full/excercises-in-basic-statistics-using-r-teacher 20/20
Microarray data analysis, module IV Spring 2009
0 2 4 6 8 10
0 . 0
0 . 2
0 . 4
0 . 6
0 . 8
1 . 0
Power for sample sizes 2-8sd = 1
Difference in mean
P o w e r
2 samples/group3 samples/group4 samples/group5 samples/group6 samples/group7 samples/group8 samples/group
2 samples/group3 samples/group4 samples/group5 samples/group6 samples/group7 samples/group8 samples/group
2 samples/group3 samples/group4 samples/group5 samples/group6 samples/group7 samples/group8 samples/group
2 samples/group3 samples/group4 samples/group5 samples/group6 samples/group7 samples/group8 samples/group
2 samples/group3 samples/group4 samples/group5 samples/group6 samples/group7 samples/group8 samples/group
2 samples/group3 samples/group4 samples/group5 samples/group6 samples/group7 samples/group8 samples/group
Figure 4: An example of the output of some more advanced R coding.
THE END !