statistics toolbox in - university of albertalkgray/uploads/7/3/6/2/... · r advantages: • state...

Statistics Toolbox

in

A Review of Analysis Techniques for Scientific

Research

Professional Development Opportunity For Flow Cytometry Core Facility

University of Alberta, Faculty of Medicine

Fall 2018/Winter 2019

Participant Workbook

LKG Consulting

Email: [email protected]

Website: www.consultinglkg.com

Instructor: Laura Gray-Steinhauer, Ph.D. P.Stat.

Email: [email protected]

A little about your instructor:

Laura Gray-Steinhauer has been a designated Professional Statistician with the Statistical Society of Canada since 2014. Over the last 7 years she has taught R as well as introductory, multivariate and spatial statistics courses at the University of Alberta, the University of Victoria, and for her LKG Consulting clients. Laura wants to help researchers conquer their R programming and statistic challenges and is open to follow-up and assists all attendees of her workshops. From Laura: “If you have a R or statistics challenge after this workshop, please send me an email and we can work together to figure out your next steps”.

All publication and copy rights for this document are held with LKG Consulting. Both partial and whole version of this document cannot be reproduced without the written permission of Laura Gray-Steinhauer c/o LKG Consulting.

Copyright © 2018 Laura Gray-Steinhauer Please contact Laura Gray-Steinhauer, Ph.D., P.Stat to obtain permission to redistribute the content and examples in this workbook to participants not registered in the University of Alberta, Faculty of Medicine, Flow Cytometry Core Facility Professional Development workshop, “Statistics Toolbox in R, A Review of Analysis Techniques for Scientific Research” held in 2018/2019 in Edmonton Alberta.

mailto:[email protected]

1

Welcome to Statistics with R!

Why R for your statistical analysis? R is an open-source software programming language for data manipulation, simulation, statistics, and graphics. It has become the lingua franca among statisticians, and is increasingly being used for data analysis among researchers.

R Advantages:

• State of the art: Researchers provide their methods as R packages. SPSS, SPlus, SAS are years behind R! • 2nd only to MATLAB for graphics (if you can draw it on paper, you can craft it in R).

• With a little practice, you can do ANYTHING with this software

And the best part…. R is FREE, open-source software! That means you can easily download it on any computer and the software including all the corresponding speciality packages are updated as the developers improve and new statistical techniques gain popularity.

In this course

In this course we will focus on how to select the appropriate technique from your statistical toolbox for your analysis needs as well as cover a few of the more commonly used techniques in greater depth. All of this while using R!

For some of you this course will feel very fast, intense, and information filled. Don’t worry no one masters the software or the elements in the toolbox overnight, but with practice it will all become easier, I promise!

So let’s get started!

Details about this Workbook

This workbook is yours to keep. In this course I will meticulously go through examples with you. This way you will see how to enter, execute, debug code and interpret results of the techniques we cover. Most of the examples in this workbook are environmentally based (it’s my area of interest), but the techniques can easily be extended to other kinds of data. Keep this book and the example datasets as a resource and try all of these techniques on data from your own project.

A few things to notice/remember:

• To make it easier for you to identify what R commands are, throughout this course I highlight R

commands with Century Gothic font (all the text is in Arial).

E.g: read.table()

mean()

• To further distinguish arbitrary variable names and numbers, I highlight the arbitrary variable names and numbers in bold. You will have to modify those bolded names if you name your objects differently or when you use these techniques with your own data.

E.g. read.table(“filepath”)

mean(mydata)

• For most of the examples in this workbook you are able to execute the R code provided using the data files provided. In some cases, however, the example code is provided for reference

2

using dataName, filepath, dataTable, columnName, etc. placeholders. In these cases, the

provided code will not execute as is, but will need to be modified to fit your data examples.

• In examples you will see both <- and = to define objects. Commonly in coding language <- is used to say “whatever on the right of the arrow push to the left object name” and = is used to ask if 2 things are equal. In R either syntax will work (with a few exceptions). To be save it is

best to use <- when defining an object.

• R is case sensitive, but it ignores spaces so make sure you are keeping track of your variables.

• Brackets are important! R distinguishes between ( ), [ ], and { }. Each of them has a different function so be mindful of which one you are using.

Additional R and Statistics Resources

Supplemental R reference material is widely available on the web. A good resource is The R Project website (http://www.r-project.org/), where a list of manuals is available for free downloads (check out “An Introduction to R Notes on R: Programing Environment for Data Analysis and Graphics”.

Additional recommended online resources for R:

• Quick-R (http://www.statmethods.net/) – a great google-style resource for almost anything with clear examples.

• Cookbook for R (http://www.cookbook-r.com/) – a good resource for simple reminders.

There are also MANY statistics references available online and in print. Google is the best resource to find things online (surprise, surprise) – reference books are typically the best resources for statistics reminders.

Additional online courses/tutorials:

• Coursera (http://www.coursea.org/) – offers 2000+ courses developed by partnering academic institutions that include introductory and advanced statics (many with R). Some are free, but most require enrollment.

• I also offer labs on my personal website for two course I have taught at the UofA, RENR480 (Introduction to Statistics) and RENR690 (Multivariate and Spatial Statistics). Feel free to pillage these resources.

Additional literature that might be of interest (there are MANY more out there):

• “R in Action, Data Analysis and Graphics with R: Second Edition” by Robert I. Kabacoff, published by Manning 2015.

• “Introductory Statistics with R: Second Edition” by Peter Dalgaard published by Springer 2008.

• “The R Book: Second Edition” by Michael J. Crawley, published by Wiley Press 2013.

• “Statistical Computing with R” by Maria L. Risso, published by Chapman & Hall/CRC, 2008

• “Exploratory Multivariate Analysis by Example Using R” by Francois Husson, Sébastien Lê, and Jérôme Pagès, published by Chapman & Hall/CRC, 2010

• “The Basic Practice of Statistics: Fifth Edition” by David S. Moore published by W. H. Freeman and Company 2010 (Not R related, but a good reference book).

• “Modern Statistics for the Social Behavioral Sciences” by Rand Wilcox published by CRC Press 2012.

• “Statistics Explained: An Introductory Guide for Life Sciences: Second Edition” by Steve McKillup published by Cambridge University Press 2012 (Not R related, but a simple to read statistics reference).

These are popular references, so there may be more updated versions of these texts available.

http://www.r-project.org/

http://www.cookbook-r.com/

http://www.coursea.org/

3

Unit 1 – Statistics Toolboxes

Simply speaking, the field of statistics exists because it is usually impossible to collect data from all individuals of interest (we call this our population). The only solution is to collect data from a subset of the individuals of interest (we call this a sample), and use those individuals to, hopefully, help us to learn the “truth” about the population.

After a sample is collected, the real challenge for most people is determining what to do next. Often researchers will ask “I want to learn ____ about my data, what analysis do I need to do?”. The answer to this common question depends on a couple of factors:

(1) What is the goal of your analysis? (what do you want to do)

(2) What kind of data are you working with? (e.g. continuous, categorical, binary, etc.),

(3) How many variables are included in your analysis?

(4) Do you meet assumptions for particular test?

Once you have the basic answers to these questions the decision of what statistical analysis to do generally becomes a lot simpler - you can figuratively reach into your statistical toolbox and pull out the best analysis for the job.

Analysis Goal Parametric

Assumptions Met

Non-Parametric

Alternative if fail assumptions

Binomial

Binary data/Event likelihood

Describe data characteristics

Mean Standard deviation Standard error Etc.

Median Quartiles Percentiles

Proportions

Probability distributions are always appropriate to describe data. Graphics are always appropriate to describe data.

Compare 2 distinct/independent groups

T-test Paired t-test

Wilcox Rank-Sum Test Klomogorov-Smirnov Test Permutational T-test

Z-Test for proportions

Compare > 2 distinct/independent groups

ANOVA Multi-Way ANOVA ANCOVA Blocking

Kruskall Wallace Test Friedman Rank Test Permutational ANOVA

Chi-Squared Test Binomial ANOVA

Estimate the degree of association between 2 variables

Pearson’s correlation

Spearman rank correlation Kendall’s rank correlation

Logistic regression

Predict outcome based on relationship

Linear regression Multiple linear regression

Non-linear regression Logistic regression Odds Ratio

5

The table and diagrams above summarize a series of basic options in an effort to help simplify your decision-making process based on your analysis objectives and the characteristics of your data. This toolbox is in no way complete (there are MANY more statistical techniques out there), but it does provide a good list of many commonly used tests and tools that are useful for most analyses. Further learning these fundamental techniques will help you decipher more complicated statistics your may require for individual projects. In the following units we will explore these techniques in more detail.

6

Unit 2 – Descriptive Statistics

Descriptive statistics are generally pretty simple – as the name indicates they help describe, show or summarize data in a meaningful way such that, for example, patterns might emerge from the data. Descriptive statistics do not, however, allow us to make any inferences about a population (that is where statistical tests come in – Units 3, 4, 5, 6 and 7). They are simply a way to describe data.

So, you might be saying, “I want to know about my population, why bother with descriptive statistics?” Believe me, descriptive statistics are very important because if we simply presented our raw data it would be hard to visualize what the data was showing, especially if there was a lot of it. Descriptive statistics therefore enables us to present the data in a more meaningful way, which improves our interpretations of the data and contributes to out ultimate inferences about the population.

Example Data

Ok I know that the majority of you (if not all of you) are involved in medical research in some form or another, but for fun let’s get away from medical research for a bit – I know it’s your comfort zone, but sometimes it’s good to branch out!

On my personal website (www.ualberta.ca/lkgray) I have provided a number of datasets we will use to run through examples in the workshop. All of the files are in CSV format (Comma Separated Values). There are several ways to import data to R, but I highly recommend using Excel-generated CSV files because (1) they are plain text files, which are good for long-term data archiving, (2) almost any software package can import them error-free, and (3) you can double-click CSV files to quickly open them in Excel for editing.

For this unit I am going to make you all farmers! We will consider an experiment, where three lentil varieties (A, B, C) are tested at two farms and the total yield (YIELD) and average plant height (HEIGHT) within individual plots are recorded. The total protein (PROTEIN) content of specific plots are also recorded (these are missing values).

Note that all the data on provided on my personal website is 100% hypothetical – please do not use it for purposes other than the examples in this workbook.

• If you have not already done so, import the lentils.csv dataset into R as data1:

data1<-read.csv("lentils.csv", header=T, na.strings=””)

head(data1) #view the first 6 rows of data

tail(data1) #view the last 6 rows of data

str(data1) #view data structure

2.1. Basic descriptive statistics functions in R

• There are a number of functions available in the standard R base package that will allow you to

calculate basic summary statistics (mean(), median(), mode(), etc.), but to really take

advantage of summary statistics it is best to utilize the additional functions available in the tidyr

and dplyr packages. To get access to all of the benefits, we first have to install and load the packages:

install.packages(“dplyr”,”tidyr”)

library(dplyr)

library(tidyr)

http://www.ualberta.ca/lkgray

7

• So now there are a number descriptive statistics and identification functions that you can take advantage of. Look at the help files for more information on the following functions.

o mean() which is the data average

o median() which is the 50th percentile of the data.

o sqrt() which calculates the square-root of the data.

o sd() which is the standard deviation.

o range() which returns the data range.

o IQR() which is the inner quartile range (difference between the 25th and 75th percentiles)

o mad()which is the low and high values of the absolute deviations from the median.

o min() which is the data minimum.

o max() which is the data maximum.

o quantile() which returns the data quartiles.

o first() which returns the first row in the data.

o last() which is the last row in the data.

o nth() which is the nth row in the data.

o n() which returns the number of data points – you do not need to include data in brackets.

o length() which returns the number of data points – you do need to include data in brackets.

o n_distinct() which returns the number of unique data points.

o count() which counts the number of rows with each unique value of variable

o any() which asks R “given a set of logical vectors, is at least one of the values true?”.

o all() which asks R “given a set of logical vectors, are all of the values true?”.

• Try out the functions above to see some basic summary statistics by inserting the variable

name (e.g. data1$YIELD, data1$HEIGHT) that you are interested in between the brackets to calculate statistic remembering you can only calculate a statistic for numerical variables.

# Descriptive statistics examples

mean(data1$YIELD) #returns the mean of YIELD

IQR(data1$YIELD) #returns the inner quartile range of YIELD

• Remember if your variable includes missing values, the result of the functions will be a missing value, unless you add the argument that removes missing values before the calculation

(na.rm=TRUE). This effectively indicates to R that you want to “skip” over the NAs, otherwise R will use the default (FALSE).

# Skip the missing values

mean(data1$PROTEIN, na.rm=TRUE)

• Percentiles can be calculated with the quantile() function by adding a vector of the percentiles that you want. For example, we can calculate the 2.5%, 5%, 95% and 97.5% percentiles of all lentil yields with the following code:

# Calculate quantiles

quantile(data1$YIELD, c(0.025, 0.05, 0.95, 0.975))

• There are also these handy functions to calculate multiple descriptive statistics at once:

# Returns Tukey's five number summary

# Includes minimum, lower-hinge, median, upper-hinge, maximum) for yield.

fivenum(data1$HEIGHT)

#Returns minimum, 25th percentile (1st quartile), median (50th percentile, 2nd quartile),

#mean, 75th percentile (3rd quartile), and maximum

#for all numeric variables in dataset

8

#Character values will be listed

summary(data1)

2.2. Multi-level summaries in R

• Now, simple descriptive statistics are useful, but often you want to do some more complex data summaries. For example, we may want to know the mean and standard deviation of yield and protein content for each lentil variety at each farm. With the formulas above, this would require a lot of programming to subset the data and calculate the statistics of interest. Thankfully there

we can take advantage of the function summarize() to simplify the coding. The function first requires you to group you data (so it known how to summarize it) and we can do this with the

group_by() function. Next we tell the summarize function to use the grouped dataset then provide the function or formula that should be applied to the data. Try:

#Group data1 by VARIETY

data1<-group_by(data1,VARIETY)

#Calculate means across multiple variables

summarize(data1,

YIELD.avg=mean(YIELD,na.rm=T),

HEIGHT.avg=mean(HEIGHT, na.rm=T),

PROTEIN.avg=mean(PROTEIN, na.rm=T))

• To demonstrate that you can use summarize() with any formula of interest, let’s calculate the mean and standard error of yield and protein content. Unfortunately, R does not have a built-in function that will calculate standard error, however this is the statistic that we most often want. We can however calculate it by dividing the standard deviation by the square root of the number

of observations. We can get the number of observation with the function length(). For protein content, we need the number of non-missing observations, which we can get with

length(PROTEIN [!is.na(PROTEIN)]). We’ll write the results into a table (data2) and then export the result as a CSV. Note that this time we will put the group_by() function in the summarize() function when we specify the dataset (this saves us a few key strokes)

data1.sbg <- summarize(group_by(data1, VARIETY),

YIELD.avg = mean(YIELD, na.rm=T),

YIELD.se = sd(YIELD)/sqrt(length(YIELD)),

PROTEIN.avg = mean(PROTEIN, na.rm=T),

PROTEIN.se = sd(PROTEIN, na.rm=T)

/sqrt( length( PROTEIN [!is.na(PROTEIN)])))

data1.sbg #view output from

#Export output

write.csv(data1.sbg,"lentil_summary.csv")

• You can also assign the groups by 2 variables to get a multi-way summary:

summarize(group_by(data1, FARM, VARIETY),

YIELD.avg = mean(YIELD, na.rm=T),

YIELD.q1 = quantile(YIELD, 0.25),

YIELD.q3 = quantile(YIELD, 0.75),

YIELD.min = min(YIELD, na.rm=T),

YIELD.max = max(YIELD, na.rm=T))

9

• This may be a little complicated if you are not used to programming, but you see that it is a powerful way to quickly summarize your data. It’s the numerical equivalent to a multi-factor boxplot.

# Simple multi-factor boxplot

boxplot(YIELD~VARIETY*FARM, data=data1)

2.3. Confidence intervals

• The R functions qnorm()and pnorm()convert from units of standard deviations or standard error in this case to percentiles (e.g. probabilities) and vice-versa for a normal distribution.

• We know that for a normal distribution one standard error of the mean (SEx) for large sample sizes (or one standard deviation of a normally distributed population) is equivalent to the ~68% confidence interval of the mean because ~34% of values fall within 1 SE either side of the

mean. We can confirm this with the pnorm() command which gives you the total area under the curve to the LEFT of the number that you specify.

pnorm(1)

pnorm(-1)

qnorm(0.16)

qnorm(0.84)

• We can also calculate the 90% and 95% confidence intervals using qnorm() which returns a value in standard deviations. Remember also that the confidence interval is spread around the mean, which means that we must deduct HALF the unwanted area off each side:

qnorm(0.95) #90% confidence interval, right side

qnorm(0.05) #90% confidence interval, left side

qnorm(0.975) #95% confidence interval, right side

qnorm(0.025) #95% confidence interval, left side

• If the mean and standard deviation of our sampling distribution are not 0 and 1 respectively

(which is almost always the case), we can still use the pnorm() and qnorm() commands to obtain our confidence intervals simply by entering in our mean and standard deviation. For a dataset with a normal distribution, a mean of 10 and a standard deviation of 4, we can see that percentile of 6 is equal to a standard deviation of -1:

pnorm(6,mean=10,sd=4) #Can be written more simply as pnorm(6,10,4)

• When we have a very large number of samples, our sampling distribution approaches the same shape as a normal distribution. However, when our sample size is smaller, the areas under the distribution curve actually change. Because of this, we will normally use a Student’s T-Distribution to calculate confidence intervals from samples. The commands are basically

identical, except that we us pt() instead of pnorm() and qt() instead of qnorm(). However, because the area under the t-distribution is sensitive to sample size, we must also specify our

degrees of freedom (n-1). If our sample size is 10, we would use qt() to determine the 95%

confidence intervals and pt() to determine the percentile for, say, a standard error of 1.5:

qt(0.025,df=9) #Can be written more simply as qt(0.025,9)

qt(0.975,df=9) #Can be written more simply as qt(0.975,9)

pt(1.5,df=9) #Can be written more simply as pt(1.5,9)

• The t-distribution approaches the normal distribution as the sample size increases.

• Now, let’s try to calculate the 95% confidence interval of the mean for lentil variety A on Farm1 (in this case we have 4 observations, so n-1 = 3).

10

VarAF1<- filter(data1, VARIETY=="A"& FARM=="Farm1") #subset data1

mean(VarAF1$YIELD,na.rm=T) #calculate the mean

sd(VarAF1$YIELD)/sqrt(4) #calculate the standard error

qt(0.975,3) #calculate the +/- SDs of the 95% CI

• Then we can use the formula to calculate the confidence intervals around the mean:

#Confidence intervals

mean(VarAF1$YIELD,na.rm=T) + sd(VarAF1$YIELD)/sqrt(4) * qt(0.975,3) #upper

mean(VarAF1$YIELD,na.rm=T) - sd(VarAF1$YIELD)/sqrt(4) * qt(0.975,3) #lower

• I hear you saying “Wow, that seems like a lot of work. I mean, six lines of code? You have to be

kidding me.” Well, thankfully there is a shortcut in R. The t-test() function in R (which we will work with more in Unit 4), also returns confidence intervals for a sample. You can try it out now:

t.test(VarAF1$YIELD, conf.level=0.95) #returns the 95% confidence interval

t.test(VarAF1$YIELD, conf.level=0.90) #returns the 90% confidence interval

So what do you do if you want to calculate a confidence interval and your data is not normal? Well this is a bit more complicated. Bootstrap values are the best alternative for obtaining confidence intervals for the mean. Bootstrap follows 3 steps (1) resample with replacement # times; (2) for each of these samples calculate the sample mean; and (3) calculate an appropriate bootstrap confidence interval (BCI). For the last step, there are several types of bootstrap confidence interval (BCI), and it is a good practice to calculate several BCI and try to understand possible

discrepancies between them. In R, you can easily implement this idea using the R package boot and some simple function building.

install.packages(“boot”)

library(boot)

# Create function to obtain the mean

Bmean <- function(data, indices) {

d <- data[indices] # allows boot to select sample

return(mean(d))

}

# Run bootstrapping with 1000 replications

results <- boot(data=data1$YIELD, statistic=Bmean, R=1000)

results # view results

plot(results) # plot results

# get 95% confidence interval for 4 different distributions

boot.ci(results, type=c("norm","basic","perc","bca"))

2.4. Data distributions

Data distributions a.k.a. probability distributions are the simplest way of looking at probability in data because the table or equation (illustrated by a curve) lines each outcome of a statistical experiment with its probability of occurrence. Mean values indicate the most likely observation to occur, illustrated by the apex of curves. Rare observations can be found in the tails of observations – i.e. they do not occur very often.

Often statistical tests or functions (like ANOVA or linear regression) require data to follow a normal distribution. However biological or medical data rarely appears normal. In Unit 4 you can learn how to test if your data is normally distributed.

11

• If your data is normal (or you want to follow a normal distribution) you can use the following functions:

# Generic function that generates n random numbers following the

# normal distribution with a #specified mean and standard deviation

rnorm(n,mean,sd)

# Generic function that generates a density value following the normal # distribution

with specified quantiles (series of points), as well as # a specified mean and standard

deviation

dnorm(x,mean,sd)

# Generic function that generates a distribution function following # the normal

distribution with specified quantiles (series of points), # as well as a specified mean and

standard deviation

pnorm(q, mean, sd)

# Generic function that generates a quantile function following the

# normal distribution with specified probabilities, as well as a

# specified mean and standard deviation

qnorm(p, mean, sd)

• Note that x, q and p can be either a single value or a vector of values

• R also has the ability to generate data for alternative distributions. Below shows the syntax to generate density values for 16 alternate distribution types in R. Similar values and functions that we generated for the normal distribution (above) can also be done for all of these distributions. The functions for the density/mass function, cumulative distribution function, quantile function

and random number generation are named in the form dxxx, pxxx, qxxx and rxxx, respectively,

where the xxx indicates the distribution name in R (listed in the dxxx syntax below). Consult the

cumulative distribution help file (?Distributions) or the individual distribution files (e.g.

?dbinom) for further details on the requirements and details for these alternative distributions:

o dbeta() which is the beta distribution

o dbinom() which is the binomial (including Bernoulli) distribution

o dchisq() which is the chi-squared distribution

o dexp() which is the exponential distribution

o df() which is the F distribution

o dgamma() which is the gamma distribution

o dhyper() which is the hypergeometric distribution

o dlnorm() which is the log-normal distribution

o dmultinom() which is the multinomial distribution

o dnbinom() which is the negative binomial distribution

o dpois() which is the Poisson distribution

o dt() which is the Student’s T distribution

o dunif() which is the uniform distribution

o dweibull() which is the Weibull distribution

12

Unit 3 – Parametric Statistics

Complementary to descriptive statistics, inferential statistics are techniques that allow us to use samples to generalize about the populations from which the samples were drawn. These are the

true analyses in your statistics toolbox.

Parametric statistics are a collection of statistical tests that rely on assumptions about the shape of a distribution in the underlying population (i.e. normal distribution) and about the form the parameters (i.e. means and standard deviations) of the assumed distribution. Bottom line, we need the sample data to be normally distributed with homogenous variance (a.k.a. equal variances) because this collection of statistical tests is expecting certain characteristics of your data to hold. If these characteristics do not hold, the probability of making observations changes which can make the results you find with these tests completely incorrect.

Note that parametric statistics are widely considered to be the most powerful at detecting differences between distinct/independent groups or estimating the degree of association between variables not only because they rely reference the normal distribution which has set characteristics, but they also use raw data (non-parametric statistics typically use ranked data, see Unit 4). So, if you can analyze your data with a parametric option, you should, but if you cannot don’t worry, see

Unit 4.

Example Data

For this unit we will again consider the experiment, where three lentil varieties (A, B, C) are tested at two farms and the total yield (YIELD) and protein content (PROTEIN) within individual plots are

recorded.

We will also use a second dataset that contains soil nitrogen levels, taken at a number of random planting locations (“PLOT”) throughout a harvested lentil stand, prior to (“BEFORE”) and after (“AFTER”) a harvesting treatment.







• If you have not already done so, import the lentils_nitrogen.csv dataset into R as data2:

data2<-read.csv("lentils_nitrogen.csv", header=T, na.strings=””)




3.1. Checking assumptions for parametric statistics

Independence and Randomisation

• We assume, as with all statistical test, that the experimental units were selected randomly from the population (i.e. that all members of the population had an equal chance of being selected).

13

• We also assume that experimental units are free to respond independently to treatments in the analysis (i.e. no unit should affect how any other unit responds).

Normality and Homogeneity of Variances (Equal Variances)

All parametric statistical analyses by definition require (or “assume”) the following; therefore, you should always check both of these assumptions before running a t-test, ANOVA, or any other parametric test.

(1) Your data is normally distributed (i.e. it follows the normal distribution)

(2) Your data has homogeneity of variances, also referred to as equal variances (i.e. the data spread of each group is the same)

• Although it is not best practice, you can simply check data normality and homogeneity of

variances visually with a boxplot. Note that in this workbook will use the tidyr, dplyr, and

ggplot2 package for data management and graphics (it’s superior to the options in the base package), so you first need to install and load the package to build the plot as the examples shows below.

#Install graphics package

install.packages(“ggplot2”,”tidyr”, “dplyr”)

library(ggplot2)

library(tidyr)

library(dplyr)

#Single Treatment Analysis (consider only Farm1)

graphics.off()

#Build boxplot of YIELD by VARIETY

ggplot(filter(data1, FARM==”Farm1”), aes(x=VARIETY, y=YIELD)) + geom_boxplot() +

scale_x_discrete(name=”Lentil Variety”)

#Multiple Treatment Analysis

#Build boxplot of YIELD by FARM and VARIETY to investigate data normality

graphics.off()

ggplot(data1, aes(x=paste(FARM,VARIETY), y=YIELD)) + geom_boxplot()+

scale_x_discrete(name="Farm and Lentil Variety")

• In the boxplots you are looking for are roughly symmetrical boxes and whiskers (i.e. normality), and roughly similar size boxes among treatments (i.e. homogeneity). Remember this is a visual check, not a formal statistical check so it is not as reliable. Here, my concern would be that the assumption of homogeneity of variances is likely violated in both the single and multiple treatment cases.

• Technically, you can also use histograms for each treatment combination to inspect your data. The histograms should show a symmetrical distribution. However, this does not work very well for small sample sizes and can be a bit misleading so I do not recommend relying on

histograms to determine if you meet parametric assumptions.

• A better visual check for normality and equal variances is to plot the data residuals When you

insert the linear model lm() function into the plot() function R will return multiple graphs that you sequentially visualize by repeatedly hitting return in the console window. Here you will see the Residuals vs. Fitted, Normal-QQ, Scale Location, and Constant Leverage plots. The first 2 plots are the most important. For now, you can ignore the 4th plot.:

#Single Treatment Analysis (consider only Farm1)

graphics.off()

#Plot the residuals

14

plot(lm(YIELD~VARIETY,data=filter(data1,FARM==”Farm1”)))

#Multiple Treatment Analysis

#Plot the residuals

plot(lm(YIELD~FARM*VARIETY, data=data1))

• In the “Residuals vs. Fitted” plot, for a normal distribution we want (Plot 1): (1) The residuals to "bounce randomly" around the 0 line. This suggests that the

assumption that the relationship is linear is reasonable (2) The residuals to roughly form a "horizontal band" around the 0 line. This suggests that

the variances of the error terms are equal. (3) No one residual "stands out" from the basic random pattern of residuals. This suggests

that there are no outliers.

• In the “Normal-QQ” plot (Normal Quartile plot), for a normal distribution we want (Plot 2): (1) The theoretical quantities (x-axis) to closely match the standardized residuals (y-axis).

The more this relationship resembles a 1:1 increasing ratio line (starts bottom left, ends top right) the more the data resembles a normal distribution.

• In the “Scale Location” plot for homoscedasticity we want (Plot 3): (1) To clearly see the three treatments, which vary around three different means. Lack of

consistency here is a good indication that the variances are not equal.

• Now if you think relying on “well I think that plot looks normal?” or “I think I see a horizontal band around the 0 line” is a bit wishy-washy, there are statistical test that you can use to test for significant deviations from normality and homogeneity of variances, but there is a lot of controversy as to how useful these tests really are. Both the Shapiro-Wilk Test (for normality) and the Bartlett Test (for equal variances), among others, test the null hypothesis that your data is normal. This means the resulting p-value tells you what is the probability that your data IS normal – we want to see a large p-values which very different from typical hypothesis testing. But if you get a p-value of 0.2, what does that mean? There is a 20% chance your data is normal? Do you want to accept that? Should you use parametric statistics or not?

This is where you need to use good judgement and not solely rely on the numbers!

• If you have multiple treatments (like VARIETY and FARM) you would have to check all possible combinations of treatments for normality:

#Test for Normality

F1VarA<-filter(data1, FARM=="Farm1"&VARIETY=="A")

shapiro.test(F1VarA$YIELD)

F1VarB<-filter(data1, FARM=="Farm1"&VARIETY=="B")

shapiro.test(F1VarB$YIELD)

…continue with treatment combinations

F2VarC<-filter(data1, FARM=="Farm2"&VARIETY=="C")

shapiro.test(F2VarC$YIELD)

• INTERPRETATION: The Shapiro Test reveals that Variety A (p-value = 0.9523), and Variety B (p-value = 0.7594) on Farm 1, as well as Variety A (p-value = 0.2075), Variety B (p-value = 0.795, and Variety C (p-value = 0.6225) on Farm 2 are likely normally distributed. However we had to reject the null hypothesis for Variety C on Farm 1 (p-value = 0.08883) which suggest it is likely not normal.

• Fortunately, the Bartlett test easier to execute because don’t have to split everything, but if you have more treatments you will have to test all possible combinations using code like (i.e. don’t have to split first treatment):

#Test for homogeneity of variances

15

Farm1 <- filter(data1, FARM=="Farm1")

Farm2 <- filter(data1, FARM =="Farm2")

bartlett.test(Farm1$YIELD~Farm1$VARIETY)

bartlett.test(Farm2$YIELD~Farm2$VARIETY)

• INTERPRETATION: The Bartlett Test reveals that VARIEY likely has equal variances on both farms (p-values = 0.6689, 0.5464).

• You can run the tests following the code above, but evaluate the resulting p-values with a critical eye. For example, do you accept the test results for VARIETY A on Farm 2 given the p-value indicates there an approximate 20% chance that the data is normal? If you are unsure you have a couple of options:

(1) Run your analysis using both parametric and non-parametric techniques. Do you get the same results (i.e. a significant difference between your distinct/independent groups)? If yes, well than the parametric versus non-parametric doesn’t matter – you have the same outcome. If they are not, take a critical look at why you cannot simply accept the parametric analysis – what are you willing to accept? You can always switch to non-parametric tests.

(2) Try a moderate data transformation (or an offensive one to really skew your data). Then, recalculate your p-values. Did it make a difference? You will likely find that ANOVA is quite robust against violating assumptions of normality. However, you may come across cases, where your p-values are off by a degree that you personally cannot tolerate.

That’s where you switch to non-parametric tests.

• There are a variety of popular and useful data transformations (arithmetic equations) you can use to change values consistently across the data. This will change the distribution of the data while maintaining its integrity for our analyses. There are 3 main ways to transform data, in order of least to most extreme: 1) take the square root of the values, 2) log-transform the values, and 3) take the inverse of the values. As an example, try transforming PROTEIN, and then testing it for normality. Note that in this example we do not care about FARM or VARIETY, just practicing how to transform data.

sqrt_PROTEIN <- sqrt(data1$PROTEIN) #Square root transformation

log_PROTEIN <- log(data1$PROTEIN) #Log transformation

inv_PROTEIN <- 1/data1$PROTEIN #Inverse transformation

shapiro.test(sqrt_PROTEIN) #Repeat for other transformations

• Remember due to the Central Limit Theorem – it is MORE IMPORTANT that your data meets this assumption compared to normality to carry out parametric statistical tests. F-tests (i.e. ANOVA) are very robust to deviations from normality, especially with large sample sizes (>30). Unequal sample sizes between groups can magnify departures from normality.

3.2. T-tests

T-tests allow us to compare the means of distinct/independent groups against discrete values (one-sample), each other (two-sample), or on the same individuals after an event (paired).

For the following examples, let’s assume our data meets the parametric assumptions of normality and homogeneity of variances (because we need to for t-test) – well we are using good judgement

and are accepting that the assumptions are met 😊.

Remember that t-tests can either be one-tail, where the test returns the probability that the mean of a sample is above or below a known threshold value (one-sample) or the mean of another group (two-sample), or two-tailed, which tests the probability that the mean is equal to a known value

(one-sample) or mean of another group (two-sample).

16

One sample T-test

A one-sample T test for a single sample basically compares the mean of a sample to some known value (usually the mean of the population).

• Now let’s determine the probability that the true population mean for the lentil variety A (VarA) yield from Farm 1 is larger than 650. Therefore, H0: The sample mean is not significantly larger than 650, and Ha: The sample mean is significantly larger than 650. We can test this hypothesis

with the t.test() function and specify our threshold value (mu=) and whether we are testing if

the sample mean is higher (greater), lower (less), or just different (two.sided) than this value

with the alternative argument (this is how we distinguish one-tail and two-tailed tests).

t.test(F1VarA$YIELD, mu=650, alternative="greater")

• INTERPRETATION: In this case our actual t-value is 5.1908 with 3 degrees of freedom, equating to a p-value = 0.006943. This p-value is below our assumed alpha value of 0.05 (you can change this), so this means we can reject our null hypothesis and accept the alternative hypothesis that the true population mean is greater than 650.

Two sample T-test

Now, let’s expand the T-test to two samples where we ask if the true means of two populations are significantly different. Remember that a simple comparison of two means isn’t that meaningful (of course they’re at least slightly different!) unless we also have some idea of the variation (spread) of each sample. Again, we likely don’t know the population parameters, so we are using statistics from

samples to infer about the populations.

• Let’s test if lentil yields for varieties A and B are significant different on Farm 1. Again, we will

use the R objects we made earlier for each of the varieties. In this case we set the alternative

argument to alternative=two.sided as we are testing the difference between two groups.

t.test(F1VarA$YIELD, F1VarB$YIELD, alternative=”two.sided”)

• INTERPRETATION: In this case the mean of variety A is 727.5 and the mean of variety B is 508. Uur actual t-value is 10.613 with 5.9893 degrees of freedom, equating to a p-value = 4.17e-05. This p-value is below our assumed alpha value of 0.05 (you can change this), so this means we can reject our null hypothesis and accept the alternative hypothesis that variety A and variety B do not come from the same true population, at least when we consider yields at Farm 1.

Paired T-test

Sometimes, two-sample data can come from paired observations, such as measurements taken before and after a treatment application on the same individuals or at the same locations. In a paired t-test, you are controlling some of the variation between individuals or locations by using the SAME individuals or locations for both treatments (usually a control and a treatment). This makes a paired test more powerful when detecting differences.

• Now let’s say we want to know if the harvesting treatment changed the nitrogen levels in a statistically significant way? You can see how your power to detect differences changes by comparing the standard t-test with the paired version. This is because the paired test accounts for the variance within BEFORE and AFTER samples (effectively removing “noise” from your analysis):

t.test(data2$BEFORE, data2$AFTER) # Normal t-test

t.test(data2$BEFORE, data2$AFTER, paired=T) # Paired t-test

17

• INTREPRETATION: Now when we look at the results of the normal t-test there is no significant difference between the nitrogen level before and after harvesting took place (p-value = 0.4188), however when we consider a paired t-test that controls the variation between individuals we can now see that harvested does indeed have a significant impact on the resulting nitrogen level observed in the plots (p-value = 0.019). Without controlling for the variation explained by individuals we would have missed the effect of harvesting – which is what we were interested in!

3.3. Analysis of Variance (ANOVA)

Analysis of Variance (ANOVA) is the next level of a t-test. Now we want to compare the means of 3 or more distinct/independent groups. The type of ANOVA selected (one-way, multi-way) depends on how many treatment levels included. Note there can be multiple treatment levels within a treatment, for example in the VARIETY treatment there are 3 levels, A, B, and C.

Again, for the following examples, let’s assume our data meets the parametric assumptions of normality and homogeneity of variances – well we are using good judgement and are accepting

that the assumptions are met 😊.

Here the formula definition indicates what kind of ANOVA you want to calculate in R:

Syntax Meaning

y~A indicates a one-way ANOVA

y~A+B indicates a two-way ANOVA, testing main effects only

y~A*B indicates a two-way ANOVA, testing main effects and all possible interactions

y~A+B+A:B indicates a two-way ANOVA, testing main effects A, B and interaction A:B

One-way ANOVA

The real strength of ANOVA is its ability to compare more than two samples (or more than two treatment levels). A one-way ANOVA is used when there is only one treatment whose effect is being investigated. The basic idea is to test whether the samples from each treatment level are alike or not. If the means are the same, then there is no difference between the treatment levels and therefore there is no effect of the treatment. If there is a difference, then the treatment is doing something – i.e. there is an effect! In this case.

• So, let’s run the ANOVA to look at the effect of VARIETY including all three varieties of lentil (A, B and C). on lentil YIELD. In this case let’s ignore the effect of FARM, we are simply interested in the effect of VARIETY (one treatment = one-way ANOVA).

#One-way ANOVA to look at effect of variety on yield

output1.lm<-anova(lm(YIELD~VARIETY, data=data1)) #ANOVA option #1

output1.lm #view ANOVA output

output1.aov<-aov(YIELD~VARIETY,data=data1) #ANOVA option #2 (same result)

summary(output1.aov) #view ANOVA output

• INTERPRETATION: With both syntax options, the resulting p-value is above the accepted threshold of 0.05 (you may want to set it differently) at 0.9997, therefore we fail to reject the null hypothesis and we can say there is no significant effect of lentil variety on yield.

18

Multi-way ANOVA

One major advantage of ANOVA is that it allows us to compare the effect of multiple treatments (multiple independent variables) AND their associated treatment levels. For each treatment we need two or more treatment levels (i.e. FARM we have Farm1 and Farm2). When we look at the effects, we are interested in effects of the treatments individually as well as any interactions between the treatments. The assumptions for multi-way ANOVA are, of course, the same as one-way ANOVA.

In a multi-way ANOVA, the two types of effect we are looking for are:

o Main Effect: the effect of an independent variable on the dependent variable.

o Interaction: the effect of one independent variable on the other independent variable.

Interactions fundamentally change the relationship between the independent and dependent variables. For this reason, when we find a significant interaction, we ignore the main effects in our model/output.

• Now let’s consider our larger experiment to determine the effect of both VARIETY and FARM on resulting lentil yields. We will use the same functions that we did with the one-way ANOVA, however this time we change the syntax of the function to tell R we want to look at a multi-way ANOVA (two-way in this case) with considerations for interactions. Note that the treatment order in the equation does not matter – same result from FARM*VARIETY and VARIETY*FARM.

#Multi-way ANOVA

output2.lm<-anova(lm(YIELD~VARIETY*FARM,data=data1))

output2.lm #view ANOVA output

output2.aov<-aov(YIELD~FARM*VARIETY, data=data1)

summary(output2.aov) #view ANOVA output

• INTERPRETATION: The output for a multi-way ANOVA is the same as a one-way ANOVA, however now we have more rows – one for each of the main effects (VARIETY and FARM) and one for the interaction term (VARIETY:FARM). From the output we can see again lentil variety does not have a significant effect on yield (p-value = 0.9741), and farm does have a significant effect (p-value = <2.2e-16), BUT – it is not that simple! The interaction between VARIETY and FARM is also significant (p-value = 6.928e-15), and this tells us that the effect of farm is different when you consider variety. In this case we must ignore the results for the main effects (VARIETY and FARM) and continue with pairwise comparisons to learn the true effects of these treatments.

Pairwise comparisons

Something to note about the ANOVA output: it only tells you which treatments are significant. It doesn’t tell you which treatment levels are significant (e.g. maybe variety A and B affect yield, but C does not). So, when you find a significant treatment in your ANOVA, you want to look further to see

what’s going on.

The follow-up is a pair-wise comparison (or “contrast”), which is essentially a t-test between different treatment level combinations (e.g. A vs. B, B vs. C, A vs. C). Obviously, when you only have two treatment levels, you don’t need to run a contrast between them (as with the FARM treatment above). Don’t forget that you need to adjust your alpha for every additional comparison that you do (because you are running multiple inferential tests).

• From the multi-way ANOVA, we learned that FARM is a significant main effect and there is a significant interaction effect for FARM*VARIETY. So, the next step would be to use a pairwise comparison to see what is going on. We could do this by running multiple t-tests for all the possible combinations, but remember we will also have to adjust the resulting p-values (or

19

alpha levels) for multiple comparisons. As you may have suspected, there is a faster way to do

this in R. Instead of running multiple t-tests, you can use the TukeyHSD() function that executes a Tukey’s test for Honest Significant Differences (HSD). It will run all the comparisons for every treatment level within the variables that you specify. It was also automatically adjust your p-values for multiple comparisons:

#Pairwise comparisons

TukeyHSD(aov(YIELD~FARM*VARIETY, data=data1))

• INTERPRETATION: Note again that all the comparisons for every comparison will be outputted with this test. You will only need to consider the comparisons that make logical sense.

For example Farm2:A-Farm1:A compares the effect of FARM on VARIETY A. We can see there is a significant p-value and a difference of -560, meaning that the mean YEILD of VARIETY A on Farm 1 is 560 units larger than on Farm B. In contrast the comparison of

Farm1:B-Farm2:A is not a logical or relevant comparison. In this example all of the pairwise comparisons show a significant difference with the exception of VARIETY B versus C on Farm1 and on Farm 2.

• We can also look at the interaction effect visually with an interaction plot with the simple command:

#Interaction plot of varieties across farms

interaction.plot(data1$FARM,data1$VARIETY,data1$YIELD)

#Interaction plot of varieties at farms

interaction.plot(data1$VARIETY,data1$FARM,data1$YIELD)

• INTREPRETATION: When the lines in an interaction plot cross – that means there is a significant interaction. Looking at the first plot, we can see that Farm 1 has greater Yields across all varieties, but the effect is magnified for VARIETY A. In fact, in the second plot we can see VARIETY A generates the highest Yield on Farm 1, but the lowest Yield on Farm 2.

3.4. Analysis of Covariance (ANCOVA) & Blocking

Analysis of covariance combines analysis of variance and regression. You use it if you have a covariate that influences your dependent variable, and that you cannot control experimentally. For example, in the lentil field trial, you may have also measured the natural variation in nitrogen levels. You can measure it but you can’t do much about it, so it’s a “covariate” because we expect it to change with our independent variable (even with a very weak relationship).

This covariate accounts for a small amount of “noise” in your data. If you remove this error by identifying it, your signal to noise ratio in a statistical test, such as the F-test or T-test, becomes larger, making your test more powerful. We will try this by introducing a continuous variable as a covariate into your ANOVA model.

• First, let’s make a new dataset from the lentil data so that we’re only working on Farm1. Then add the NITROGEN and BLOCK columns:

# Remove Farm2 data

data.F1 <- filter(data1, FARM=="Farm1")

# Add the NITROGEN data

data.F1$NITROGEN = c(2.2,3.4,1.9,4.3,3.8,1.1,4.9,2.5,5.7,1.0,3.2,2.0)

# Add the BLOCK data

data.F1$BLOCK =

c("B2","B3","B1","B4","B3","B1","B4","B2","B4","B1","B3","B2")

20

• Explore the effect of adding a covariate, NITROGEN in this case. If your covariate accounts for any portion of the variance (significant or not), your p-values for the treatment effect should become smaller (i.e. your test becomes more powerful to detect a difference). Because we don’t want to include the covariate in any interactions, we add NITROGEN it with a plus sign (+), not an asterisk (*).

# Determining the effect of a covariate

summary(aov(YIELD~VARIETY, data=data.F1))

summary(aov(YIELD~VARIETY+NITROGEN, data=data.F1))

• INTERPRETATION: Have a look at how the p-value for VARIETY changes after the inclusion of the covariate. Also take note of the values of the residual sums of squares and mean squares. Notice that, while the SS and MS for VARIETY do not change (the “signal”), the inclusion of NITROGEN accounts for part of the error, lowering the residual SS and MS and thus increasing the f-ratio and decreasing the corresponding p-value.

• Note that we don’t have to tell the aov() procedure which variables are treatments and which are covariates. It distinguishes between them by means of the variable type. A numeric variable will be treated as covariate, whereas treatments must be factors (i.e. categories). If you entered your treatment levels as numbers (e.g. 1 and 2 rather than farm1 and farm2), you may have to

convert it to a factor with the as.factor() command. You can also use as.numeric() or

as.character() or many others to distinguish other variable types. If you ever want to see the

format of your variables in R, you can use the structure command, str().

Blocks are typically generated by subdividing your experimental site into spatial units with similar conditions, particularly if you know there is an environmental gradient across your experimental site. However, blocks can also be imposed as a “time” factor or as an “observer” factor, if you can’t do all measurements/observations by yourself or at one time.

• Explore the effect of adding a block structure. If your blocking accounts for any portion of the variance, your p-values for the treatment effect should become smaller (i.e. your test becomes more powerful to detect a difference). Generally, the stronger the environmental gradient, the more variance you would expect your blocks to account for. Remember, we don’t want to include the BLOCK term in any interactions, so we always add it with a plus sign (+), not an asterisk (*).

# Determining the effect of a block

summary(aov(YIELD~VARIETY, data=data.F1))

summary(aov(YIELD~VARIETY+BLOCK, data=data.F1))

• INTERPRETAION: Notice that the difference between the outputs here is just like the difference between the ANOVA and ANCOVA outputs: the added “random” variable (the covariate, or BLOCK) accounts for some of the error. But take note of the degrees of freedom for BLOCK compared to NITROGEN above. A covariate has only one degree of freedom whereas a block (just like any other “treatment”) has k-1 degrees of freedom, where k is the number of levels (number of different blocks, in this case). In this case, like NITROGEN, the inclusion of BLOCK accounts for part of the error, lowering the residual SS and MS and thus increasing the f-ratio and decreasing the corresponding p-value.

21

Unit 4 – Non-Parametric Statistics

Non-parametric statistics are a collection of statistical tests that rely on no or few assumptions about the shape or parameters of the population distribution from which the sample was drawn. These tests serve as our fall back when our data simply cannot meet the assumptions of their parametric counterparts.

So, you may be asking yourself, if there are no assumptions why don’t we always use non-parametric test? Although non-parametric have the very desirable property of making fewer assumptions about the distribution from where the sample was drawn, the do have two main drawbacks.

First is that they are generally considered to be less statistically powerful at detecting differences between distinct/independent groups or estimating the degree of association between variables because they are based on ranked data (parametric statistics use raw data). Therefore, if you are planning a study and trying to determine how many patients to include, a nonparametric test will require a slightly larger sample size to have the same power as the corresponding parametric test.

Second, because non-parametric tests use rankings of values in the data rather than the raw data itself, interpreting non-parametric results can be more difficult. For example, knowing that the difference in mean ranks between two groups is ten does not really help our intuitive understanding of the data.

Bottom line – non parametric statistics are not a perfect solution (remember nothing in statistics is perfect), but they do offer a good alterative when parametric statistics simply cannot be used.

Example Data

For this unit we will consider the lentil experiment in a new way - with two categorical independent variables: three varieties of lentils (VARIETY) and two different site moisture conditions (SITE). Here we are going to consider the average height (HEIGHT) of the lentil plants in each plot

measured at the end of the season.







• We can use boxplots to see that the data is not normal for many treatment levels. We will assume for this lab that we cannot or do not want to transform the data (perhaps we want to maintain the original units). You can also use a Shapiro test to verify quantitatively that the data

is not normal in many cases. You will need to have the ggplot2 package installed to run the following code.

library(ggplot2)


ggplot(data1, aes(x=VARIETY, y=HEIGHT)) + geom_boxplot() +

scale_x_discrete(name=”Lentil Variety”)


ggplot(data1, aes(x=SITE, y=HEIGHT)) + geom_boxplot() +

22

scale_x_discrete(name=”Planting Site”)

#Shapiro Test to investigate normality

shapiro.test(data1$HEIGHT[data1$VARIETY=="A"])

shapiro.test(data1$HEIGHT[data1$VARIETY=="B"])

shapiro.test(data1$HEIGHT[data1$VARIETY=="C"])

4.1. Wilcoxon Rank Sum Test

Alternative to t-test when distributions are similarly shaped.

The Wilcoxon test (aka. Mann-Whitney test) is very common and is the non-parametric equivalent of a t-test. However, you should only use this test when your distributions are similarly shaped (e.g. when your samples are skewed in the same direction). To start, we will compare just the two levels of the SITE treatment: xeric (dry) and mesic (moist). We will do this first with a t-test for comparison, then with a Wilcoxon test.

• First, we will run a one-sample t-test equivalent using both the t-test and the Wilcoxon test in R.

The wilcox.test() function works the same as the t.test() function where you need to include

the alternative argument as “greater” or “less” for one-tailed test and “two.sided” for a two-tailed test. We will compare a height measurement of 0.46 to the rest of the xeric heights. How do the results of the two tests compare?

#Subset data by site type

data.x<-filter(data1, SITE=="xeric")

#Compare results from a one-sample t-test and wilcox test

t.test(data.x$HEIGHT,mu=0.46, alternative=”two.sided”)

wilcox.test(data.x$HEIGHT,mu=0.46, alternative =”two.sided”)

• INTERPRETATION: The p-value for the wilcox test (p-value = 0.0004883) is slightly higher than the t-test (p-value = 0.4.197e-05) because the non-parametric test relies on ranked data. More importantly, however, the non-parametric p-value is below our assumed alpha value of 0.05 (you can change this), so this means we can reject our null hypothesis and accept the

alternative hypothesis that the true population mean is not equal to 0.46.

• Now, run the non-parametric equivalent of a two-sample t-test using the Wilcoxon test. (This is often referred to also as a Mann-Whitney test.) Compare the lentil heights between the xeric and mesic sites with a t-test (which assumes normality for both samples) and with a Wilcoxon test. What do you notice?

#Subset data by site type

data.m<-filter(data1,SITE=="mesic")

#Compare results from a two sample t-test and wilcox test

t.test(data.x$HEIGHT,data.m$HEIGHT)

wilcox.test(data.x$HEIGHT,data.m$HEIGHT)

• INTERPRETATION: In this case the non-parametric test indicates there is no significant difference (p-value = 0.9539) between the height of lentil plants on xeric and mesic sites at alpha value of 0.05 (you can change this).

• You may notice R gives you a warning “cannot compute exact p-value with ties”. This means you have two identical values in your data, these are called ties. When the data is ranked identical values effectively “share” the rank, making the ranks are not unique anymore and hence exact p-values cannot be calculated. This is not a problem, in most cases, R just likes to warn you.

23

• We can also run a paired test with a Wilcoxon test (but not with these data), just as we did with

a t-test. To do this, we would simply add the option paired=TRUE to the command (just like a t-

test).

• Note that there is an exact option that you can specify if your sample sizes are not large

(default is on). If you disable it (set this option to NULL in R), the procedure will take some computational shortcuts. You may need to do this for large samples. Otherwise the test may run for hours or days!

4.2. Kolmogorov-Smirnov Test

Alternative to t-test when distributions are of different shape.

The Wilcox rank sum test is not suitable if you have distributions/samples that do not look roughly similar in a histogram (i.e. they don’t have similar skew). In that case, you can use the again less-powerful Kolmogorov-Smirnov test.

• Coding in R for the Kolmogorov-Smirnov test, using the ks.test() command, uses basically the

same syntax as the t.test() and wilcox.test() commands. We’ll try using it for comparing the same samples (xeric and mesic) as above. As far as I know, there is no way to run a one-sample test with the K-S test (but this is no problem, as you can always use a Wilcoxon test for that).

# Two-sample, two-tailed test

ks.test(data.x$HEIGHT,data.m$HEIGHT)

# Two-sample, one-tailed test

# Note that the order of the variables must be reversed (or use “less”)

ks.test(data.x$HEIGHT,data.m$HEIGHT, alternative=”greater”)

• INTERPRETATION: As in the previous case the non-parametric p-value indicates there is no significant difference between the height of lentil plants on xeric and mesic sites (p-value =

0.9963) and the xeric height is less than the mesic height (p-value = 0.7165).

• Note again that R gives you a warning “cannot compute exact p-value with ties”, which you can ignore in this case (see previous explanation).

4.3. Kruskal-Wallis Test

Alternative to one-way ANOVA for non-normal distributions.

The Kruskal-Wallis test is an extension of the Wilcox rank sum test for more than two treatment levels, just as a parametric ANOVA is a kind of extension of the parametric t-test. However, the K-W test may only be used as an equivalent of a one-way ANOVA. It is not capable of comparing more than one treatment (more than one factor).

• We can use the K-W test to compare the heights of the lentil varieties and compare those results to a parametric ANOVA output. How do the outputs from these tests compare?

# K-W test

kruskal.test(HEIGHT~VARIETY,data=data1)

# Compare the output to a parametric ANOVA

summary(aov(HEIGHT~VARIETY, data=data1))

• INTERPRETATION: As in ANOVA, a significant treatment effect indicates that at least one population median differs from another (we compare medians, as means are somewhat meaningless in skewed data). In this case, there is a significant difference in lentil height among the three varieties (p-value = 0.01043). So, it must be followed up by pair-wise Wilcoxon tests,

24

comparing each of the treatment levels and because of that you will have to manually adjust your p-values if you run multiple comparisons.

#Subset data for pairwise comparisons

VarA<-filter(data1, VARIETY==”A”)

VarB<-filter(data1, VARIETY==”B”)

VarC<-filter(data1, VARIETY==”C”)

#Pairwise wilcox tests to determine height difference among lentil varieties

wilcox.test(VarA$HEIGHT, VarB$HEIGHT, alternative=”two.sided”)

wilcox.test(VarA$HEIGHT, VarC$HEIGHT, alternative=”two.sided”)

wilcox.test(VarB$HEIGHT, VarC$HEIGHT, alternative=”two.sided”)

#Adjust p-values for multiple comparisons

p.adjust(c(1, 0.09265, 0.0001554), method="bonferroni",n=3 )

• INTERPRETATION: The adjusted p-values from the multiple comparisons indicate that there is no difference in height between VARIETY A and B (p-value = 1) or VARIETY A and C (p-value = 0.27795), but there is a significant difference in height between VARIETY B and C (p-value = 0.0004662).

4.4. Permutational Tests

Alternative to parametric statistics – without the loss of statistics power!

There is a clever alternative to sums-of-squares based ANOVA that compares groups with distances measures, which does not require any assumptions about distributions. Conceptually, this is very similar to ANOVA. In ANOVA, the F-value is the ratio of variance between groups (signal) to variance within groups (noise). The bigger the signal to noise ratio, the larger the F-value, which can be converted to a p-value.

This also works with distances: Instead of an F-value, you can calculate a Delta-value (or an equivalent statistic that goes by some other name), which is the ratio of distances between observations between groups (signal) to distances within groups (noise). This gives me a signal to noise ratio that is my Delta value (equivalent to an F value). Next, you need to compare this to an

expected distribution of that value (like an F-distribution).

The problem is that we don’t know how that distribution looks like for non-normal data. So, the permutational ANOVA generates this distribution empirically, by randomly shuffling the class variables within the dataset. Now that the class variable has been randomized relative to the actual observations, you calculate a new randomly expected Delta-value. You randomize (permutate) your class variable again, and calculate another Delta-value.

Repeat 10,000 times and you get a Delta distribution specific to your type of data that you can use to get the p-value to indicate significance between groups. This is simply the position (percentile) of your original Delta-value on the distribution of 10,000 values that you generated. Hence, you assume nothing about the shape of the distribution. You effectively generate a known distribution from your data, which you can test against – this means you can still use raw data rather than ranked data like in other non-parametric tests. You keep your statistical power!

Permutational Analysis of Variance

• Let’s go back to the original lentil dataset with two farms (data1). To start let’s run a regular

ANOVA for reference:

anova(lm(YIELD~FARM*VARIETY, data=data1))

25

• Next, let’s run the permutational ANOVA as an alternative, which requires that you install and

load the R package lmPerm. We can use the aovp() function to run a permutational ANOVA

with the same formatting as we would with the parametric anova() or aovp() functions. Note that we include the argument seqs=T, which calculates sequential sum of squares, just like

under the default setting of a regular ANOVA implemented with the lm() function, which is a good choice for balanced designs.

# Permutational ANOVA

install.packages(“lmPerm”)

library(lmPerm)

summary(aovp(YIELD~FARM*VARIETY, data=data1,seqs=T))

• INTERPRETATION: The output from the permutational ANOVA looks similar to the parametric

ANOVA output, however the key difference is that rather than F value now we have Iter values which indicate the number of iterations (“swapping” and recalculating) that R had to do until the criterion was met to determine the output. In this case R had to go through 5000 iterations to determine FARM and the interaction of FARM:VARIETY were significant, but far fewer to determine VARIETY was not significant (p-value = 0.9412). Notice that the sum of squares, mean sum of squares, and p-values are close if not identical between the parametric ANOVA and the permutational ANOVA – this is likely due to the fact that parametric test were appropriate for this dataset, but also illustrates the point that you do NOT loose statistical power when you use a permutational ANOVA.

• One of the best parts of using a permutational ANOVA is, you can follow this up with pairwise comparisons, just like in a regular ANOVA:

output3.p<-aovp(YIELD~FARM*VARIETY, data=data1,seqs=T)

TukeyHSD(output3.p)

• INTERPRETATION: Like we found with the parametric pairwise comparisons all combinations are significantly difference with the exception of VARIETY B versus C on Farm1 and on Farm 2.

Permutational T-Test

• By sub-setting the data to just two treatments, we can of course implement a permuational T-test as well. The first two lines subset the data. Then we carry out a regular T-test for reference, and the permutational version that does not make any assumptions about normality or homogeneity of variances for comparison.

#Subset data

VarAB.F1=filter(data1,FARM=="Farm1"&VARIETY!="C")

head(VarAB.F1) #view first 6 rows of new data

#Permutational T-test

t.test(VarAB.F1$YIELD~VarAB.F1$VARIETY)

TukeyHSD(aovp(VarAB.F1$YIELD~VarAB.F1$VARIETY, seqs=T))

26

Unit 5 – Binomial Statistics

Binomial data occurs when your data has two mutually-exclusive classes. Some common examples of binomial responses include presence/absence of species, survival/mortality, simple yes/no

responses, pass/fail, infected/uninfected with disease, etc.

Test for binomial data also represent a useful alternative to non-parametric tests if your data is hopelessly skewed or non-normal in some way. For example, data of species frequencies that contains a very high number of zeros can be impossible to normalise, but can easily be converted into presence/absence data for analysis. Test for binomial data relatively powerful, so even reducing data to a binomial format can yield good results (vs. using non-parametric tests).

Example Data

For this unit we will consider new data set that includes the survival (SURVIVAL) – recorded as 1 (YES) and 0 (NO) – of the three different lentil varieties (VARIETY). The fertilizer combination that

was applied to each plot was also recorded by component (NITROGEN, and PHOSPHORUS).


• If you have not already done so, import the lentils_survival.csv dataset into R as data3:

data3<-read.csv("lentils_survival.csv", header=T, na.strings=””)




5.1. Setting Up the Data

Your data typically needs to be in a particular format to use binomial tests. This format is somewhat different than the format we have been using so far in class. For this lab, we will use a small dataset that recorded the rates of plant mortality of different species after a pesticide treatment.

• Re-arrange the data so that we have one column with the species levels (SPECIES), one with the total number that survived (ALIVE), and one with the total number that did not survive (DEAD). You can do this manually in R, but it takes a fair amount of code. Thankfully, the we

can use the summarize() function in the dplyr package to break the data down by the VARIETY variable, then summarise it by counting the number in SURVIVE that are not =”NO”

then not =“YES”. The table() function nested in the summarize() function performs the

counting function. You will need to install and load the tidyr and dplyr packages to execute the following code.

# Install and load the required packages

install.packages("dplyr",”tidyr”)

library(dplyr)

library(tidyr)

#Assign groups to data

data3<- group_by(data3, VARIETY)

# Create a data table of the three species with counts of ALIVE & DEAD

#as well as total observations

data.AD <- summarize(data3, ALIVE=table(SURVIVE, exclude=0),

DEAD=table(SURVIVE, exclude=1), N=n())

27

data.AD # view new data table

5.2. Z-test for Proportions

This test is not to be confused with the calculation of a z-score, which tells us how a single value fits into a normal distribution. The Z-test for proportions is the binomial equivalent to a one-sample t-test, as it compares a sample of binomial counts (the observed proportion) to an expected proportion.

• You can execute a Z-test for proportions with the z.score.pval() function in the corpora

package - you will have to install and load this package. This function requires that we provide three numbers: the frequency (count) of a given response, the total number of observations (the sum of both responses in a binomial response), and the expected proportion (the null hypothesis). Let’s use the test to determine if survival rates for each species are different than

50% (or p = 0.5). The testing works like the t.test() function so in this case different means the

alternative is “two.sided”)

# Install and load required package

install.packages("corpora")

library(corpora)

#Subset varieties

data.A <- filter(data.AD, VARIETY==”A”)

data.B <- filter(data.AD, VARIETY==”B”)

data.C <- filter(data.AD, VARIETY==”C”)

# Calculate if survival of each species is different than 0.5

z.score.pval(data.A$ALIVE, data.A$N, ,p=0.5, alternative=”two.sided”)

z.score.pval(data.B$ALIVE, data.B$N, ,p=0.5, alternative=”two.sided”)

z.score.pval(data.C$ALIVE, data.C$N, ,p=0.5, alternative=”two.sided”)

• INTERPRETATION: Based on the p-values from each test, variety A (p-value = 0.502335) and variety C (p-value = 0.2683816) are not significant different than 50% survival however variety B is significantly different (p-value = 0.0007962302) at alpha level 0.05 (you can change).

• Just like we did with t.test() and the non-parametric alternatives, we can also run a one-way analysis with this function. If we suspect that the survival rate for variety B is greater than 50%, we can specify this as our alternative hypothesis.

# One-way z-test for proportions for variety A

z.score.pval(data.B$ALIVE, data.B$N, ,p=0.5, alternative=”greater”)

• INTERPRETATION: In this case our suspicion was correct, the test says we can reject the null hypothesis and the survival rate for variety B is greater than 50% (p-value = 0.006353148)

5.3. Chi-Squared Test

First, this test is pronounced “kai” as in the greek letter χ (chi), not “chai” like the tea nor “chee” like

the Zen life force.

This test applies if you have two or more treatments and is equivalent to an ANOVA but with binomial response data.

• The chi-squared test can be execute with the chisq.test() function, however note that function uses matrices as input data which simply a data table that contains only numbers. Let’s see if there is a significant difference in survival among the lentil varieties.

# Assign row names to data table – matrix will lose the VARIETY column

28

rownames(data.AD)<- data.AD$VARIETY

#Make a matrix object for analysis

data.AD.mat <- as.matrix(data.AD[,c(“ALIVE”, “DEAD”, “N”)])

# Chi-squared test for difference in survival among varieties

output.chi<-chisq.test(data.AD.mat)

output.chi # view test output

• INTERPRETATION: In this case we are able to reject the null hypothesis that the probability of survival among all varieties is the same (p-value = 0.02795).

• From the test output object, you can then ask for a bit more information about the test, such as the table of observed values, the table of expected values (expected), the p-value (p.value),

etc. Have a look at the help file for the chisq.test() function under the “Values” heading.

# View the elements of the chi-squared test

output.chi # view the test output as normal

output.chi$p.value # returns only the p-value

output.chi$statistic # returns only the chi-squared value

output.chi$observed # returns the table of observed counts

output.chi$expected # returns the table of expected counts

• Like with ANOVA or non-parametric analyses, if you find a significant difference you should go ahead and compare the individual species (A vs. B, A vs. C, B vs. C) to see which differ. You

can do this prop.test() function which is the test of equal or given proportions among two groups. Like the chi-squared test, this test again require data to be in matrix form. Also remember if you are doing multiple comparisons you need to adjust your p-values accordingly.

# Covert data into 2 variety matrices – selected by rows and column values

ABtab=as.matrix(data.AD[1:2,2:3])

ACtab=as.matrix(data.AD[c(1,3),2:3])

BCtab=as.matrix(data.AD[2:3,2:3])

#Proportion test for each 2 variety matrix

prop.test(ABtab)

prop.test(ACtab)

prop.test(BCtab)

#Adjust p-values for multiple comparisons

p.adjust(c(0.00285, 0.2343, 0.03647), method="b", n=3)

#Alternative syntax (proportional test and p-value adjustment at once)

p.adjust(c(prop.test(ABtab)$p.value,prop.test(ACtab)$p.value,

prop.test(BCtab)$p.value), method="b", n=3)

• INTERPRETATION: The adjusted p-values from the multiple comparisons indicate that there is no difference in survival between VARIETY A and C (p-value = 0.7029) or VARIETY C and C (p-value = 0.10941), but there is a significant difference in height between VARIETY A and B

(p-value = 0.00855).

• It should also be noted that chi-squared test are considered somewhat unreliable when you have expected counts smaller than 5 (in any group). If this is the case, you should use Fisher’s

Exact Test with the fisher.test() function which works exactly like the chisq.test()If you are

interested look at the help file ?fisher.test for more information. I would also encourage you to

have a look through the help files for both the chisq.test(), prop.test(), z.test.pval() functions, as there are many options to work with.

29

Unit 6 – Basic Regression in R

Regression and correlation are not “statistical tests”, but they do deserve a spot in your statistical toolbox, as these analyses allow for the associations between predictor and response variables to

be quantified and mapped.

Example Data

For this unit we will consider a few more variables recorded in the lentil experiment along with the variables we have previous looked at: plot density (DENSITY) which considered the number of plants in each plot, and the amount of fertilizer applied to the plot (FERTILIZER). Here we are going to consider the average height (HEIGHT) of the lentil plants in each plot measured at the end of the season.

We will also consider the binomial dataset that includes the survival (SURVIVAL) – recorded as 1 (YES) and 0 (NO) – of the three different lentil varieties (VARIETY). The fertilizer combination that

was applied to each plot was also recorded by component (NITROGEN, and PHOSPHORUS).

Last, we will also use a dataset that contains multiple columns of non-linear data including: Y1 and Y2 is growth of a bacterial culture over time, Y3 is timber volume of a forest stand over time, Y4 is photosynthetic rate as a function of light, Y5 is the nitrogen fixation rate of symbiotic bacteria as a function of fertilizer concentration, and finally Y6 and Y7 is survival as a function of temperature for two species.







• If you have not already done so, import the lentils_survival.csv dataset into R as data3:

data3<-read.csv("lentils_survival.csv", header=T, na.strings=””)




• If you have not already done so, import the non-linear.csv dataset into R as data4:

data4<-read.csv("non_linear.csv", header=T, na.strings=””)




6.1. Outliers

Outliers in data are important to consider because they can distort predictions and effect the accuracy of your analysis and interpretations. While they are important to consider for all analyses, outliers can specifically have a big impact in correlations and regression analyses because these analyses depend heavily on the central tendency of the data. Outliers can therefore easily nullify an

otherwise strong association.

30

Note that “outlier” is not actually a statistical result, but it is actually a descriptive statistic – you are simply describing certain data points as outliers. Typically for a given continuous variable, outliers are those observations that lie outside 1.5 * IQR of the 25th (first) and 75th (third) quartiles, where

IQR, the ‘Inter Quartile Range’ is the difference between the two quartiles (remember the IQR() function in Unit 2). Sometimes you will here the terms “moderate outlier” and “far outlier”, this just quantifies how big of an outlier the point is. A “moderate outlier” meets the 1.5*IQR definition, where a “far outlier” is defined as 3*IQR. One classification is not better than the other - you still have an outlier you have to deal with – but far outliers are generally harder to transform and deal

with.

• The easiest way to detect if your continuous variable has an outlier is to look at a boxplot. By definition, the box represents the Inner Quartile Range and R will identify the outliers in your variable as dots on your boxplot. Note that it will not tell you if they are “moderate outliers” or “far outliers”, just that they meet the 1.5*IQR minimum criteria. You will have to install and load

the ggplot2() package (include the tidyr and dplyr packages– we will use the later) to execute the following code.

# Install packages for graphics

install.packages(“ggplot2”, “tidyr”,”dplyr”)

library(ggplot2)

library(tidyr)

library(dplyr)

# Boxplot to identify outliers

graphics.off()

ggplot(data=data1, aes(x=SITE,y=HEIGHT)) + geom_boxplot()

• INTEPRETATION: In this case the boxplot does not show us any outliers (not dots). To illustrate what we want to see let’s make a new object and add an outlier.

#Subset data and add outliers

data.m.out<-filter(data1, SITE=="mesic")

data.m.out$HEIGHT[1] <- 10 #create an upper height outlier

#Calculate the outlier bounds for mesic

IQR.fence <- 1.5*IQR(data.m.out$HEIGHT) #outlier criteria

lower.fence <- quantile(data.m.out$HEIGHT, 0.25) – IQR.fence

upper.fence <- quantile(data.m.out$HEIGHT, 0.75) + IQR.fence

paste(“Outlier fence”,lower.fence, upper.fence,sep=”,”) #view outlier fence

# Boxplot to identify outliers

graphics.off()

ggplot(data=data.m.out, aes(x=SITE,y=HEIGHT)) + geom_boxplot()

• INTERPRETATION: We can see the outlier (10) that we added lies outside of the outlier range of mesic height (0.44625 to 0.61625). When we use a boxplot on this new data we clearly see the outlier identified as a black dot. Test yourself to determine if this outlier can be considered a “moderate” outlier or a “far outlier”.

• Now if you are more inclined to use a statistical test than visuals to identify data outliers you can use a specific chi-squared test for outliers which test the null hypothesis that the highest (or lowest) value variable is not an outlier (alternative hypothesis is the value is an outlier). The chi-squared test can be considered better than alternatives because it can handle larger sample sizes, and you can specify the variance in your data (does not assume normal) which improves

increases its sensitivity to mild outliers. You will need to install and load the outlier() package to

access the chisq.out.test() function. If the argument opposite=TRUE, you will be testing if the

31

highest extreme value is an outlier and if you set it to FALSE, you will test if the lowest extreme value is an outlier (remember outliers can happen on both sides of your data set).

#Install required package

install.packages(“outliers”)

library(outliers)

#Chi-squared test for outliers

chisq.out.test(data.m.out$HEIGHT, variance=var(data.m.out$HEIGHT), opposite=TRUE)

chisq.out.test(data.m.out$HEIGHT, variance=var(data.m.out$HEIGHT), opposite=FALSE)

• INTERPRETATION: The first chi-squared test for outliers indicates we cannot reject the null hypothesis (p-value = 0.4011), meaning the minimum height on the mesic sites is not an outlier. However, when we consider the maximum height on the mesic sites (the outlier we added) we are able to reject the null hypothesis (p-value = 0.004697) confirming our findings from the boxplot.

• Note that there are MANY other tests to determine data outliers (e.g. Grubbs Test, Dixon’s Test, Cochran Test, etc.), but they all have their caveats. If you are interested, explore the outlier package – there are functions available – and read up on each test to determine which is best for your particular purposes and data characteristics. Remember that you always look at a simple boxplot.

• Ok so you have identified an outlier – now what do you do? There are options (there is no perfect solution), but whatever you choose to do, you MUST document your actions and the rationale you used for any peer reviewed publication or your thesis:

(1) Leave it and choose an appropriate analysis - If this data point is a true data point, you may not be able to justify transform your data to remove it or simply just remove it from your analysis so consider including it in your analysis and chose an analysis that is not (or less) sensitive to outliers (e.g. non-parametric, non-linear regression, etc.).

(2) Transform your data – If you can transform the data and easily remove the outlier this is a good approach. See Unit 3 for more on transforming data. Note you will likely have to use a

technique that has a bigger impact.

(3) Remove the outlier – If you can justify your reasons, you can remove an outlier. Consider situations like it was a data entry error, or false observation. Just because you don’t like the fact you have an outlier is NOT a reason to remove it – even if it makes your analysis harder (data is data). If you do remove the point, consider the impact on your analysis, for example is your design now unbalanced or did you by chance remove another important factor? Also consider the impact that removing the outlier will have on your interpretations of the final results.

(4) Impute the data point – If you think the outlier is an error, in some case it makes sense to impute the data point using the variable mean, median, or mode. This approach is more commonly used to address missing values, but it can also be applied to outliers.

(5) Cap the outlier - For values that lie outside the 1.5 * IQR limits, you could cap those observations outside the lower limit with the value of 5th percentile and those that lie above

the upper limit, with the value of 95th percentile.

(6) Predict the outliers with modeled relationship – Like we did with the imputing option, we can simply replace data outliers with modelled values predicted based on an established relationship between the response and predictor variable.

32

6.2. Correlations

Correlation analysis investigates the relationship between two continuous variables. The type of correlation you deploy should depend again on characteristics of your data – distribution and homogeneity of variances (e.g. parametric versus non-parametric)

Peason’s Correlation (Parametric option)

Peason’s correlation is the most commonly used correlation to measure the degree of the relationship between linearly related variables. However, the assumptions for this correlation statistic (e.g. absence of outliers, normality of variables, linearity, and homoscedasticity) are often overlooked.

• Homogeneity of variances (homoscedasticity) and normality in Y for a given X value can best be explored with residual plots. Testing for homoscedasticity usually involves an examination of the “residuals”, which are simply the “leftovers” after you deduct your data values from a given value (e.g. from the mean). Refer to Unit 3 for how to interpret a residual plot for normality and homogeneity of variances.

plot(lm(YIELD~HEIGHT, data=data1))

• INTERPRETATION: In this case the residuals appear to be normally distributed and the variance appear to be equal. Note that this is a loose acceptance that the data meets the assumptions (plots are not perfect), but I am comfortable moving forward with a parametric

Pearson’s correlation.

• A Peason’s (parametric) correlations will tell you both the magnitude and direction of the association between two variables, because the raw data is used. We can easily calculate a Pearson’s correlation coefficient, and test the significance of the correlation using the default

method in the cor() and cor.test() functions.

cor(data1$HEIGHT, data1$YIELD) ## returns correlation coefficient

cor.test(data1$HEIGHT, data1$YIELD) ## tests significance

• INTERPRETATION: We can see that there is a significant positive correlation between YIELD and HEIGHT (r=0.5955, p-value = 0.002138), meaning the plots with taller plants have higher yields – makes logical sense.

Kendall & Spearman Rank and Correlation (Non–Parametric options)

If assumptions of normality and/or homogeneity of variances are violated, both the Kendall and Spearman rank correlations are very well regarded robust test statistics for non-parametric correlations.

In particular, a Spearman correlation coefficient is essentially the Pearson correlation between the ranked values of the variables. It is a useful calculation if you are only interested in the direction of a relationship and not the magnitude. While you may get a higher correlation coefficient with a Spearman test, you must be careful with this. Remember that we are asking a fundamentally different question with a Pearson correlation (relationship between the order and magnitude of the data values) than with a Spearman correlation (relationship between the order of the data values).

Kendall correlation coefficient is best used to measure the ordinal association (think the association between how you and your friend rank ice cream flavours). This approach is further from Pearson’s correlation, but like Spearman is useful if you are only interested in the direction of a relationship and not the magnitude.

33

• Running non-parametric correlations is the same as parametric; we just need to specify the method this time. Note that if tested the parametric assumptions for DENSTY and HEIGHT you would meet normality and homogeneity of variances (pretty well) – below is just for example.

# Kendall correlation

cor(data1$DENSITY,data1$HEIGHT,method="kendall")

cor.test(data1$DENSITY,data1$HEIGHT,method="kendall")

#Spearman correlation

cor(data1$DENSITY,data1$HEIGHT,method=”spearman”)

cor.test(data1$DENSITY,data1$HEIGHT,method=”spearman”)

• INTERPRETATION: We can see that both the Spearman and Kendall test found a significant negative correlation between DENSITY and HEIGHT (Kendall: -0.4462, p-value = 0.002762 and Spearman: -0.5783, p-value = 0.00307), meaning planting at a higher density results in shorter lentil plants – again makes logical sense.

• You may notice for the Kendall and Spearman correlations, R gives you a warning “cannot

compute exact p-value with ties”. This means you have two identical values in your data, these are called ties. When the data is ranked identical values effectively “share” the rank, making the ranks are not unique anymore and hence exact p-values cannot be calculated. This

is not a problem, in most cases, R just likes to warn you.

6.3. Simple Linear Regression

To assess the causal relationship among two variables, you need to derive an equation of the format: y=b*x + a or “dependent variable” = ”slope” * ”independent variable” + “intercept”. The statistical test for the significance of a regression function is actually a test of whether the slope of the regression is significantly different from zero. Remember for linear regression your data needs to meet the assumptions of (1) normality, and (2) homogeneity of variance (homoscedasticity) –

let’s assume our data meets these assumptions 😊.

• In R, the general linear model is implemented by the lm() function. In the following code, the first line gives runs the linear regression and outputs the equation formula, and the second line displays the full details.

# Simple linear regression

output.lm <- lm(YIELD~HEIGHT, data=data1) ## alternative syntax for linear model

summary(output.lm) #view summary

• INTERPRETATION: In this case the resulting linear model is YIELD = 208.46 + 88.09*HEIGHT, and height is a significant predictor in the linear model (p-value = 0.00214) with alpha value 0.05 (you can change). However, the R-squared value is 0.3547 which indicates a poor model fit – so the average height of plants in the plots is not a great predictor of total plot yield.

6.4. Multiple Linear Regression

Multiple linear regression is like simple linear regression – just with more variables considered. Model outputs show you the intercept as well as the coefficients, t-values and p-values for each term, which indicating whether that term accounts for a significant amount of variance in the model (note that, for the three predictor variables, as SE increases the t decreases and thus the p-value increases). The output also provides some noteworthy statistics at the bottom, including our total residual SE, our df, the r2 of the model, and the F-value (and corresponding p-value) of the model.

Typically, you will find that the r² value (“Multiple R-squared”) increases as you add additional variables to your model. However, this does not necessarily mean that your model is “better”. It only

34

means that it accounts for more of the variance in the response variable. It is possible to have a “perfect” fit of r²=1 when your “model” passes through every data point (you would need as many model components as data points). This is the equivalent to a simple regression through two data points. So remember, as you increase the amount of independent variables, your degrees of freedom decrease, so we want to keep models as simple as possible. (Think of each variable you purchase for your model costs you a degree of freedom, which weakens your analysis.) R takes this into account in its model summary, providing you with an adjusted r-squared: a statistic essentially identical in meaning to a regular r2, but taking into account sample size AND the number of predictor variables in the model. When we compare models with different predictors, we should

always look at the adjusted r2.

• Now let’s see if we can get a better model to predict yield by adding some additional factors. We will use the plus symbol “+” between each predictor in the linear model.

# Build multiple linear regression model

summary(output.lm) # review simple linear model output

output.lm2<-lm(YIELD~HEIGHT+DENSITY, data=data1) # MLR with 2 variables

summary(output.lm2) #view summary

output.lm3<-lm(YIELD~ HEIGHT+DENSITY+FERTILIZER, data=data1) #MLR with 3 variables


• INTERPRETATION: The 2 variable multiple linear regression (MLR) model YIELD = 1155.30 + 24.89*HEIHGT – 30.62*DENSITY indicates that DENSITY 0.157 significantly effects YIELD (p-value = 7.37e-07), and the overall fit is better than the simple linear model previous tested (Adjusted R-squared = 0.7856). The 3 variable MLR model YIELD = 1136.1856 + 25.0805*HEIGHT – 30.5170*DENSITY + 0.1*FERTILIZER indicates that only DENSITY significantly effects YIELD (p-value = 1.56e-06), however we lose model fit compared to the 2 variable model (Adjusted R-squared = 0.7753). Overall these results suggest that maybe the best model for lentil yield is simply density alone. We can confirm this with a correlation and a simple linear model:

# Correlation of density and yield

cor(data1$DENSITY, data1$YIELD) ## returns correlation coefficient

cor.test(data1$DENSITY, data1$YIELD) ## tests significance

#Simple linear model to quantify relationship

output.lm4 <- lm(YIELD~DENSITY, data=data1) ## alternative syntax for linear model


• INTERPREATION: We can see that there is a significant negative correlation between YIELD and DENSITY (r=0.885487, p-value = 8.929e-09), meaning the plots with fewer plants have higher yields – makes logical sense. The resulting linear model is YIELD = 1311.224 – 34.108*DENSITY, and DENSITY is a significant predictor in the linear model (p-value = 8.93-09). Notice that the p-values for the correlation and the linear model are the same (they are doing the same thing). Although the model fit is as good as the 2 variable MLR model we previously tested (R-squared = 0.7841), all variables in this model are significant, meaning adding HEIGHT to model does not have a big impact yield – so it is not necessary to include.

6.5. Akaike’s Information Criterion (AIC)

AIC considers both, the fit of the model and the numbers of parameters, where more parameters result in a “penalty”. The model fit is measured as likelihood of the parameters being correct for the population based on the observed sample. The number of parameters is derived from the degrees of freedom that are left. Put simply: AIC roughly equals the number of parameters minus the likelihood of the overall model, therefore the lower the AIC value, the better the model (and large negatives are low!).

35

AIC is run through the stepwise function step() in R. Stepwise model comparison is an iterative model evaluation that will either:

(1) Starts with a single variable, then adds variables one at a time (“forward”)

(2) Starts with all variables, iteratively removing those of low importance (“backward”)

(3) Run in both directions (“both”)

• Akaike’s Information Criteria (AIC) is a good way to indicate the best balance between model complexity and model fit. AIC is the criteria used in the “step” procedure of R, which selects the best equation through a search procedure that tries out most (or both) combination of variables and selects the model terms that contribute significantly to the overall variance explained:

#Stepwise comparison to identify model fit

step(lm(YIELD~ HEIGHT+DENSITY+FERTILIZER, data=data1), direction="backward")

step(lm(YIELD~ HEIGHT+DENSITY+FERTILIZER, data=data1), direction="forward")

step(lm(YIELD~ HEIGHT+DENSITY+FERTILIZER, data=data1), direction="both")

• INTERPRETATION: In this particular case, the “forward” and “backward” selection procedures come to a different conclusion. This is relatively unusual for a dataset with few variables, but may happen if your independent variables are highly correlated (think about our correlation results for DENSITY and HEIGHT). In this case, you may simply pick the equation that has the

lowest AIC value (the backward-selected model, in this case). The “both” procedure confirms what we the backward selection and what our previous analysis told us that the multiple linear

model with YIELD~DENSITY+HEIGHT is best (AIC = 213.46).

• There are a variety of options in the step() function. Notably, you may specify a threshold for keeping/dropping variables from a model (e.g. a critical p-value). You may also specify a maximum number of steps to consider (i.e. if you want a final model with a certain number of

variables). Have a look at ?step for more information on these argument.

6.6. Non-linear Regression

Sometimes, your relationship between predictor and response variables may not be linear. The right type of non-linear model (be it exponential, power, logarithmic, polynomic, etc.) is usually conceptually and preferably determined through subject considerations. Non-linear regression, therefore, does the same job as linear regression, but with a relationship that is assumed to deviate from linearity (this is equivalent to a non-parametric statistic).

In the formulae, y is always your dependent (response) variable and x is always your independent (predictor) variable. The other terms (a, b, and c) are parameters of the model. For all formulae you can add an additional parameter (e.g. +d), if your intercept does not pass through the origin but is at y=d instead. Below are some examples of non-linear equations (there are MANY more options).

o Hyperbolic: y=a+b/(x+c)

o Exponential: y=a+b*c^x

o Logarithmic: y=a+b*x^c

o Sigmoidal increase: y=a/(1+(b^(x-c))) or decrease: y=a/(1+(b^(-x+c)))

o Michaelis Menten increase: y =(a*x)/(b+x) or decrease: y =a+(-a*x)/(b+x)

o Parabolic upward: y=a-b*(x-c)^2 or downward: y=a+b*(x-c)^2

• Curve fitting for non-linear regression is a trial and error process, where the program tries a random value of a parameter, then the program increases or decreases that value and evaluates if the fit gets better or worse. The best function I have found for non-linear regression

in R is the nlsLM() function from the minpack.lm package which aims to minimize the sum

36

square of the vector returned by the function by a modification of the very robust Levenberg-Marquardt algorithm. This function can sometimes figure it out start values to for the iterative process by itself, but often we have to specify the start values to give the function a starting point. Note that R may not find a function that fits and end with an error message. Here is an example for a simplified log function that has a zero intercept and where we give a rough guess of start values based on the data:

# Install required package

install.packages("minpack.lm")

library(minpack.lm)

# Non-linear regression – log curve

nlsLM(Y1~a*PV^b, data=data4) #try without providing starting values

nlsLM(Y1~a* PV^b, data=data4, start=list(a=1,b=2)) #include starting values

• INTERPRETATION: In this case R was able to estimate the non-linear curve without us having to specify the values (we did in the second line which is commonly what you will need to do). Overall, R was able to fit the logarithmic model Y=0.8896*PV^1.9539.

• It a very good idea to check if your curve actually fits the data visually. In the stat_function()

function you can include the output model from nlsLM() to add the curve to an established plot

(in the example below that would be a logarithmic function: Y1~a*PV^b).

# Create function for plotting

log.curve<-function(x)0.8896*x^1.9539

# Plot data and curve

graphics.off()

ggplot(data4,aes(x=PV, y=Y1)) + geom_point() +

stat_function(fun=log.curve,color=”blue”, size=1)

• Now you are probably wondering “how do we determine if the non-linear regression model is a good fit?” Unfortunately calculating an R2 is NOT APPROPIATE for non-linear regression. For linear models, the sums of the squared errors always add up in a specific manner:

𝑆𝑆𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛 + 𝑆𝑆𝐸𝑟𝑟𝑜𝑟 = 𝑆𝑆𝑇𝑜𝑡𝑎𝑙, therefore 𝑅2 = 𝑆𝑆𝑅𝑒𝑔𝑟𝑒𝑠𝑠𝑖𝑜𝑛/𝑆𝑆𝑇𝑜𝑡𝑎𝑙 must mathematically must

produce a value between 0 and 100%. But we can use AIC to evaluate the fit of the non-linear

model components to the data. Remember, in AIC, each additional parameter in the model (a,

b, c, etc.) results in a “penalty” while “credit” is given for more variance explained by the model. Remember, lower AIC values are better (and large negatives are low).

#Use AIC to determine which non-linear model best fits the data

AIC(nlsLM(Y1~a*PV^b, data=data4, start=list(a=1,b=2))) #logarithmic model

AIC(nlsLM(Y1~a+b*PV^c, data=data4, start=list(a=0,b=1,c=2))) #exponential model

AIC(nlsLM(Y1~a+b*PV, data=data4, start=list(a=0,b=1))) #linear model

• INTERPRETATION: Comparing these three non-linear models we can see that the logarithmic model has the lowest AIC value (AIC = 25.38196) and is therefore the best fit to our data compared to the exponential and linear models.

6.7. Logistic Regression

When your response variable is binomial (two alternate categorical responses) you cannot use a traditional regression to explain it. This is common with response variables such as “yes” vs. “no” or “presence” vs. “absence” or “success” vs. “failure” etc. To get around the categorical nature of the response, we can treat two alternate responses as probabilities instead of discrete classes. This is really just one probability, as the probability of one class is the same as the probability of not the

37

other class. These regression models are based on a “logistic model” or “logit model” which is essentially a curve between two alternative y responses along an x variable (a binomial distribution).

• A logistic regression works just like a simple or multiple linear regression. It is capable of handling continuous or categorical independent variables (predictors), however it is generally harder to make inferences from regression outputs which use discrete or categorical variables.

Although the lm() function can be use, the glm() function is preferred as it is a bit more robust.

To indicate that you have a binomial response, we must set the family argument to be binomial. Let’s build a model to look at effect of the fertilizer components on survival.

#Build logistic regression model

lr.model <- glm(SURVIVE~NITROGEN+PHOSPHORUS, data=data3, family="binomial")

summary(lr.model) #regression output

• INTERPRETATION: The output from the logistic regression shows that NITROGEN is likely a meaningful addition to your model because changes in these fertilizer amounts are related to changes plant survival (p-value = 0.00711).

• We can also use the same stepwise tools that we used with multiple regression to choose the best model (based on AIC).

# Stepwise comparison to identity model fit

step(glm(SURVIVE~NITROGEN+PHOSPHORUS, data=data3, family="binomial"))

• INTERPRETATION: As the regression output suggest, the AIC identified that the best logistic model is when only NITROGEN is considered (AIC = 97.43)

• Like with non-linear regression, calculating an R2 is NOT APPROPIATE for logistic regression. There are “pseudo R2” statistics out there (e.g. McFadden’s Pseudo R2), but there is MUCH debate about how effective the are at measuring model goodness of fit. It is best to use AIC to evaluate the fit of a logistic model compared to other options. Remember, in AIC, each additional predictor results in a “penalty” while “credit” is given for more variance explained by the model. Remember, lower AIC values are better (and large negatives are low).

• Also, to get a better indication of predictor variables to include in your logistic model, you can use a binomial ANOVA which uses the output from a generalized linear model referencing logistic regression and the binomial distribution – which acts like a chi-squared test (Unit 5).

Once you build the logistic model, the next step is to use anova() function with the specification that we want to calculate p-values using a chi-squared test and chi-squared

distribution rather than the F-distribution (anova() default) which is reserved for parametric statistics. The output from this binomial ANOVA is similar to the parametric option, however rather than the sum-of-squares and means sum-of-squares displayed, now we have the resulting deviance from each of the parameters. Remember deviance is a measure of the lack of fit between the model and the data with larger values indicating poorer fit. The p-values are calculated using the chi-squared distribution, but like the parametric alternative they indicate whether or not each of the predictors has a significant effect on the probability of achieving a “success” (value of 1).

# Binomial ANOVA

bi.anova <- anova(lr.model, test="Chisq")

bi.anova #ANOVA output

• INTERPRETATION: The binomial ANOVA, confirms the logistic regression results that NITROGEN has a significant effect on the probability of achieving a “success” (plant survival,

value of “1”) (p-value = 0.0004239) and therefore should be included in the logistic model.

38

• If you have a logical reason to do so, you can also determine if there are interactions in our data by including * in the model statement. You could improve your overall model fit by including the interaction between 2 variables which will reduce complexity by reducing the number of parameters. If you find there are significant interactions, this could indicate collinearity among your predictors or it could indicate the effect of the predictor on the response is more complex. You could then follow this up with pairwise z-tests for proportions with adjusted p-values for multiple comparisons to determine the effects.

# Logistic model with interactions

lr.model2 <- glm(SURVIVE~ NITROGEN*PHOSPHORUS, data=data3, family="binomial")

summary(lr.model2)

bi.anova2 <- anova(lr.model2, test=”Chisq”)

bi.anova2 #ANOVA output

• INTERPRETATION: In this case the model summary does not indicate that the interaction model is a good fit (no significant p-values), and the binomial output confirms that there is not significant interaction between nitrogen and phosphorus treatments.

• Acknowledging that a logistic regression model with nitrogen alone seems to be the best for the survival data, let’s revise the model to simply include this variable.

#Update logistic regression model

lr.model.n <- glm(SURVIVE~NITROGEN, data=data3, family="binomial")

summary(lr.model.n) #regression output

• Last, we can use the logistic regression results to determine the odds ratio which simply means the odds of a success (in this case survival) based on the predictors in the modelled relationship. The odds ratio for the model is the increase in odds above the value of the

intercept when you add a unit to the predictor(s). We can find this by using the exp() and

coef() functions. In this case we can look at the odds of survival, based on the nitrogen added to the plot.

# Odds ratio of the logistic model

exp(coef(lr.model.n))

• INTERPRETATION: In this case the odds ratio is 0.222819 which implies that a 1 unit increase in NITROGEN increases the odds of survival by a factor of 0.22.

• You can then convert the odds ratio into an estimated probability of survival for a given value of

nitrogen using the logit model equation 𝑦 =𝑒𝛽0+𝛽1𝑥1+⋯+𝛽𝑛𝑥𝑛

1−𝑒𝛽0+𝛽1𝑥1+⋯+𝛽𝑛𝑥𝑛, which is the base of logistic

regression. For our logistic model, we only need the coefficients for the intercept 𝛽0 and the

nitrogen 𝛽1.

# Extract model intercept and beta coefficient from logistic model

lr.intercept<-lr.model.n$coefficients[1]

lr.n.coeff<-lr.model.n$coefficients[2]

# Estimate probability of survival

exp(lr.intercept + lr.n.coeff*200)/(1+(exp(lr.intercept+lr.n.coeff*200))) # 200 nitrogen



• INTERPRETATION: Based on our logistic model, if 200 units of nitrogen are applied to a plot there is a 40% chance of survival, but if the nitrogen is increased on 500 or 750 units the probability of survival increases to 77% and 93%, respectively.

39

Unit 7 – Multivariate Orientation (Bonus)

All of techniques that now available in your toolbox (from this workshop) are univariate – this simply means they work when you have a single response variable, but what happens if you want to consider more variables at the same time to see how they all interact? This is where your toolbox dramatically expands! To take a VERY shallow dive into multivariate tools – lets look at some classical multivariate techniques that rely on rotating a dataset in multiple dimensions and then looking at the results through a 2-dimensional “window”.

Example Data

For this unit we will consider the sample datasets available in R (see the datasets package for a list of all the datasets available).

First, we will use the USArrests dataset that contains statistics, in arrests per 100,000 residents for assault, murder, and rape in each of the 50 US states. Also given is the percent of the population living in urban areas.

Second, we will use the iris dataset that contains 150 flower measurements (petal and sepal length and width) of three iris species (50 specimens from each species).

• If you have not already done so, load the USArrests dataset into R:

data(“USArrests”) #load R data into workspace

head(USArrests) #view the first 6 rows of data

tail(USArrests) #view the last 6 rows of data

str(USArrests) #view data structure

pairs(USArrests) #graph variables in pairs

• If you have not already done so, load the USArrests dataset into R:

data(“iris”) #load R data into workspace

head(iris) #view the first 6 rows of data

tail(iris) #view the last 6 rows of data

str(iris) #view data structure

pairs(iris) #graph variables in pairs

7.1. Principle Component Analysis

Principle component analysis (or PCA) uses an orthogonal transformation to convert a set of observations (response variables) of possibly correlated variables into a set of values of linearly uncorrelated variables called principle components – essentially, we reduce the dimension in the data and end up measuring the data in terms of its principal components rather than on a normal x-y axis. Dimension reduction reduces the data down into its basic components, stripping away any unnecessary parts. Principle components are the directions where there is the most variance in your data – in other words where the data is most spread out.

Data points are organized by eigenvectors and eigenvalues which are paired. Simply put an eigenvector is a direction, and an eigenvalue tells you how much variance in your data is linked to the direction. Principle components are the directions where there is the most variance in your data – in other words where the data is most spread out. Data is effectively “rotated” on a single axis to maximize the variation explained (PC1), then the rotation on a second axis to again maximize the remaining variation explained (PC2) – this will continue for the number of variables you include in your analysis.

40

• To run a PCA use the princomp() function is available in the base package (there are other packages that do PCA as well). The main argument we need to specify is whether we want to

work with a correlation or covariance matrix in our analysis using the cor argument. You tend to

use the covariance matrix (e.g. cor=F) when the variable scales are similar and the correlation

matrix (e.g. cor=T) when variables are on different scales so that you are not biasing the results just do to the different data scales. It’s common (R default) to use the correlation matrix because it standardises the data before calculating the principle components and it will generally give better results, especially when the scales are different. Let’s use PCA on the

USArrests dataset available in R – there is a fair amount of information in this dataset and it is not trivial to effectively visualize.

# Principle component analysis

out1<-princomp(USArrests, cor=T)

• For this analysis, I find it best to extract the information you are interested in individually from the output object. Below we request principal component loading (correlations with original variables), component scores (new coordinate of points after rotation), and variance explained

by the principal components (with the summary statement). The princomp() function does not report eigenvalues as extractable information, but you can to calculate them manually, and you can convert them to variance explained by dividing them by the number of variables.

# Output from PCA

out1$loadings #correlations with original variables

out1$score #new coordinate of points after rotation

summary(out1) #variance explained by components

eigen(cor(USArrests)) #eigenvalues

eigen(cor(USArrests))$values/4 #variance explained by eigenvalues

• INTERPRETATION: o First the loadings tell us which of the original predictors is driving the range of the

principle component – in this case PC1 is fairly equally related to the murder, assault and rape data, PC2 is driven by urban population data (negative correlation = -0.873), PC3 is driven by rape data (negative correlation = -0.818), and PC4 is driven by assault

data (negative correlation = -0.743).

o Second, we can see the coordinates of each of the new PCs – we use these for plotting.

o Next, we can see that PC1, PC2, PC3, and PC4 explain 62%, 25%, 9% and 4% of the variation in the data respectively (proportion of variance in the table). This means that if I plotted PC1 and PC2 I would be able to explain 87% of the variation in my data with 2 variables (cumulative proportion in table) – the more variation we can explain in 2 variables (2-dimensional space) the better.

o Last, the eigenvalues match the loadings, and the variance explained match the

summary results – just another way of seeing the data.

• Finally, we can make a quick biplot of the rotated points, choosing the first and second principal

component as axes with the choices argument, which we have seen above, explain much of the total variance. The bi-plot also adds the component loadings as vectors.

# PCA biplot to visualize the rotated points

biplot(out1, choices=c(1,2))

• INTERPRETATION: The direction of the arrows represents the eigenvectors and the size of the arrow represents the eigenvalues (longer arrows are bigger values). Points (in this case states) that are in the direction of the arrow have higher values of the variable the arrow represents. For example, Alaska, Maryland, and Georgia (a few examples) have higher murder rates, and California has a higher urban population. The opposite direction of the arrows is true as well.

41

For example, South Dakota, and West Virginia have fewer rapes. The position of each point represents the cumulative “pull” from all the arrows. For example, Mississippi has higher murder, assault and rape occurrences (all of those arrows are for the most part going in the same direction), but a lower urban population (opposite direction to the urban population arrow). You can make inferences on crime in all the states from this simple PCA output – which again explains 87% of the variation in the data.

7.2. Discriminant Analysis

Just like principal component analysis, discriminant analysis is a rotation-based technique and can simply be used to visualize your data. Now, rather than maximizing the total variance explained as is done with the principle components in PCA, discriminant analysis aims to maximize the total variance between groups.

Discriminant analysis can answer a number of questions including, “are there significant differences between groups?” or “to which group does a new observation belong?” These questions are often asked in the fields of taxonomy, paleontology, or anthropology (e.g. did we find a new species? To which ancestor does that jaw-bone belong?). Discriminant analysis is a poplar analytical tool in these fields.

One important note when considering discriminant analysis: You need an a priori classification system to use discriminant analysis – this analysis does not help you to define classes in the first place (you have to use another technique like cluster analysis or nonmetric multidimensional scaling (NMDS) which are not covered in this workshop).

• To run a discriminant analysis will need to use the candisc() function available in the candisc package – you will need to install and load this package. To use this function you will first need

to create a linear model with the lm() function where your columns are bound together (use the

cbind() function), then input that linear model into the candisc() function. Let run a discriminant analysis on the iris sepal and petal measurements to look at species traits.

# Install required package

install.packages(“candisc”)

library(candisc)

x=lm(cbind(Petal.Length,Sepal.Length,Petal.Width,Sepal.Width)~Species,data=iris)

out2=candisc(x, term="Species")

• Again, for this analysis, I find it best to extract the information you are interested in individually from the output object. Below we first look at the results of the discriminant analysis including the variance explained by each linear discriminant. We then extract the discriminant function loadings (correlations with original variables) and finally plot a nice biplot with group centroids and labels, as well as custom coloring and scaling of vectors.

# Output from discriminant analysis

summary(out2)

out2$structure # discriminant function loadings

plot(out2, which=c(1,2), scale=8, var.col="#777777", var.lwd=1,

col=c("red","green","blue"))

• INTERPRETATION: o First, the discriminant summary tells us that the first discriminant function explains 99%

of the variation in the data and the second function explains 1% (percent in table). Note that discriminant analysis will only give you 2 variables and they will always add up to 100% of the variation explained.

o Second, the structure of the discriminant functions shows that the first discriminant function is primary driven by a negative correlation with the petal attributes (petal length

42

= -0.985, petal width = -0.973) – this means that these 2 variables likely explain the majority of the variation between species. The second discriminant function is driven by a negative correlation with sepal width (sepal width = -0.758).

o The plot let’s us simply visualize these results – the high variation explained by the first discriminant function is seen by the clear distinction between the iris species along the x-axis of the plot. The arrows work the same as they did with PCA. In this case species virginica is associated with larger petal attributes (both length and width) as well as larger sepal length. Alternatively, setosa has smaller petal attributes.

• The output from the discriminant analysis represents a relationship we can use to classify new

observations in the predict() function. However to be able to do this we have to run the

discriminant analysis using the lda() function in the MASS package as the candisc() function does not produce a predictable output. The syntax for the linear discriminant analysis is

lda(classvariable~., dataset). The dot means “all other variables”, but you could list them individually as well. Let’s say we have 6 new iris plants and we want to classify they by species

using their observed measurements. Note that the biplot available in the MASS package is not

as nice as the one from candisc – which is why I recommend only use the MASS option if you

want to predict. In the plot the argument asp=1 forces the two discriminant axes to have the same scale, which better illustrates which function is most effective in separating the groups.

# Install and load required package

install.packages(“MASS”)

library(MASS)

# Discriminant analysis

out2.v2=lda(Species~., iris)

# Output from discriminant analysis

out2 #view analysis results

lda.scores <- predict(out2.v2, iris)$x #new coordinate of points after rotation

# Build data table with new observations

Sepal.Length <- c(4.7, 4.8, 6.2, 5.1, 5.6, 6.3)

Sepal.Width <- c(3.2, 3.2, 2.9, 2.5, 2.8, 3.5)

Petal.Length <- c(1.7, 1.6, 4.2, 3.1, 4.1, 6)

Petal.Width <- c(0.2, 0.2, 1.3, 1.1, 1.3, 2.5)

newObs <- as.data.frame(cbind(Sepal.Length,Sepal.Width,Petal.Length,Petal.Width))

# Predict species of new observations with discriminant analysis

out2p <- predict(out2.v2, newObs)

scores_unknown <- out2p$x

plot(lda.scores, col=rainbow(3)[iris$Species], asp=1)

points(scores_unknown, pch=19)

• INTERPRETATION: We can clearly see on the plot that 3 of the points seem to belong to species versicolor (green), 2 of the unknown points seem to belong to species setosa (red), and a single point seems to belong to species virginica.

7.3. Multivariate Analysis of Variance (MANOVA)

To address the question: “do we really have different species?” we can carry out a Multivariate Analysis of Variance (MANOVA). This is in fact mathematically exactly the same thing as discriminant analysis, above. We are just asking a different question.

43

• First, we have to do some data preparation for the manova function to work properly. This involves splitting the dataset into the class variable and into measurements, and then defining those as a matrix of numbers, and the species variable also as a factor (or class) variable.

# Separate dataset for MANOVA

species=as.matrix(iris[,5])

measurements=as.matrix(iris[,1:4])

• Just as with a standard ANOVA, the MANOVA analysis assumes both normality and equal variances of residuals. For MANOVA, we also must test each response variable within each level of each treatment.To simplify let’s consider the residual plots for all species.

# Residual plots to look at MANOVA assumptions

plot(lm(iris$Petal.Length~iris$Species))

plot(lm(iris$Petal.Width~iris$Species))

plot(lm(iris$Sepal.Length~iris$Species))

plot(lm(iris$Sepal.Width~iris$Species))

• INTERPRETATION: In this case the residuals for each variable appear to be normally distributed and the variance appear to be equal - I am comfortable moving forward with a MANOVA.

• To run the MANOVA, you can use the manova() function that is available in the stats package and build the model that measurements (multivariate response) are predicted by species. Let’s run the MANOVA to look at the differences in measurements among species. In the MANOVA summary, we can see how each response variable relates to each treatment

# MANOVA

out3 <- manova(measurements~species)

summary.aov(out3) #view MANOVA output

• INTERPRETATION: In this case we can see that there is a significant difference among species in all 4 measurements (all p-values = 2.2e-16). This means there is really a distinction in physical characteristics among these species.

• Note that you can run a MANOVA with multiple predictors by including * in the model statement (as we did with ANOVA). Unfortunately, as far as I know, there’s not a good way to run multiple comparisons, aside from looking at each effect individually (in separate MANOVAs). If your interested in a particular effect or interaction, run a univariate ANOVA instead.

• Typically, when we run MANOVA, we want to know if the arrangement of responses (if ALL responses together) are significant as a whole. To obtain this, we need to ask for specific statistics. MANOVA can be interpreted with one of three multivariate F-tests: Pillai's trace (R default), Wilk’s lambda, and Hotelling-Lawley's trace. The most common MANOVA test is Wilk’s lambda, which can be useful as the result is often interpreted as the proportion of the variance explained by the model. Before you choose a test, however, you should read up on them to determine which is the best metric for your purposes.

# Results considering all measurements together

summary(out3, test="Wilks")

• INTERPRETATION: When all measurements are considered together there is a difference among species (p-value = 2.2e-16) which not surprising given the individual results. But more interesting is that combined all measurements explain 2% of the variation – remember this is combined and the discriminant analysis results was more informative for these purposes.

44

Appendix 1 – Statistics Terminology

It is always good to have with a basic understanding of the vocabulary we will be using. Use the list below as a reference for the topics we will cover in this workshop as well as an ongoing reference for analysis you encounter in publications and your own work. Accuracy – a measure of how close an estimator is expected to be to the true value of a parameter

Actual t-value - the t-value that corresponds to the raw data value being tested with the 𝛼−𝑙𝑒𝑣𝑒𝑙 (signal-to-noise ratio)

Alpha level - predetermined probability where we make some sort of decision

Alternative hypothesis (H1) – states a value of a variable or relationship between variables that is different from the null hypothesis. Tests may have one or more alternative hypotheses which may

be rejected or confirmed.

Bias – how far the average statistic lies from the parameter it is estimating

Central limit theorem – states “Sample means tend to cluster around the central population value. Therefore, when sample size is large, you can assume that sample mean is close to the value of the true population mean. With a small sample size, you have a better chance to get a mean that is far off the true population mean

Confidence interval - a range of values we are fairly sure our true value lies within. We are so sure we use a % to define it (e.g. 95% Confidence Interval)

Continuous variable– values can fall anywhere on an unbroken scale of measurements with real limits (e.g. temperature, height, volume of fertilizer, etc.)

Critical t-value - the t-value that corresponds to the 𝛼−𝑙𝑒𝑣𝑒𝑙

Datum (data) – value of variables (e.g. what you record)

Degrees of freedom – the number of values in the final calculation of a statistic that are free to vary

Dependent variable – properties of things (e.g. what you measure on your research subjects)

Descriptive statistics – numerical and/or graphical summary of data

Discrete variable – values may only fall at particular points on the scale of measurement and cannot exist between points (e.g. number of patients, number of plants, etc.)

Error - difference between an observed value (or calculated) value and its true (or expected) value

Experiment – any controlled process of study which results in data collection, and which the outcome is unknown

Independent variable – environment of things (e.g. what you measure because you think it influences your research subjects)

Inferential statistics – predict or control the values of variables (make conclusions with)

Interval data – quantitative measurement that indicates BOTH the order of magnitude AND implies equal intervals between the measurements. Note, these measurements have ARBITRARY ZEROS (e.g. Temperature (◦C)

All statistics allowed, but no × or ÷ (alternative % change)

Noise – a measure of the distribution of the data

45

Nominal data – qualitative measurement where categories or numbers ONLY label the object being measured or identify the object as belonging to a category (e.g. sample plots identified by 1-10 or by location, qualitative categories: Low-Medium-High or Male/Female, etc.) Don’t calculate statistics on this data type – for example how do you take a mean of male/female?

Null hypothesis (H0) – states the expected value of a variable or relationship between variables. This usually reflects the “normal state” of the variables: what you would expect to find without applying any treatments. Each test may only have one null hypothesis and it may be rejected, but may not be confirmed or proven.

Ordinal data – quantitative measurement that indicates a relative amount, arranged in rank order, but DOES NOT imply and equal distance between points (e.g. ranking of patient growth performance, where 1 is worst and 10 is best) Percentiles or Non-parametric statistics ONLY

Parameter – an unknown value (needs to be estimated) used to represent a population

characteristic (e.g. population mean)

Percentile - the value below which a given percentage of observations within a group fall

Precision – a measure of how close measured/estimated values are to each other

Population – general class of things (e.g. what you want to learn something about)

Power – the ability to reject the null hypothesis when it is false (i.e. your ability to detect an effect when there is one)

P-value - (percentiles) the probability the observed value or larger is due to random chance

Quartile - (1st, 2nd, 3rd, 4th) points that divide the data set into 4 equal groups, each group

comprising a quarter of the data

R – the best statistical software on the market – and it’s free!

Ratio data – quantitative measurement where numbers indicate a measure with EQUAL intervals and a TRUE ZERO (e.g. Precipitation (156mm), frequencies (counts of just about anything))

All statistics allowed

Sample – group of things representing a clan (e.g. what you are actually studying)

Sampling distribution (a.k.a. Probability distribution or Probability density function) – probability associated with each possible value of a variable

Signal - the difference between the test and mean values

Standard error – the standard deviation of a statistic which measures confidence in our sample to calculate the population statistic. Small values indicate that the sample is more likely to be representative of the overall population. Large values indicate that it is less likely the sample

adequately represents the overall population

Standard error of the mean – reflects the overall distribution of the means you would get from repeatedly resampling

Statistic – estimation of parameter (e.g. mean of a sample)

Statistical inference – to make use of information from a sample to draw conclusions (inferences) about the population from which the sample was taken

Type I Error – Error you make when you reject the null hypothesis when it is in fact true

Type II Error – Error you make when you fail to reject the null hypothesis when it is not true

Unit – thing, location, entity (e.g. an individual research subject)

46

Appendix 2 – Flow Cytometry Specific Packages

Now I realize that you probably don’t want to be a farmer for the rest of the time you are using R – you’re here to learn basic techniques that can help you navigate through your own analyses. I must admit Flow Cytometry is not my area of expertise, but to give you a head start on using these techniques in your own area I have tracked down the following list of packages that are applicable and commonly cited in the peer review literature.

• flowAI()

Monaco G and Hao C (2016). flowAI: Automatic and interactive quality control for flow cytometry data. R package version 1.5.0.

More Information available at: http://bioconductor.org/packages/devel/bioc/html/flowAI.html

• flowClean()

Fletez-Brant K, Spidlen J, Brinkman RR, Roederer M and Chattopadhyay PK (2016). “flowClean: Automated identification and removal of fluorescence anomalies in flow cytometry data.” Cytometry Part A.

More Information available at: http://bioconductor.org/packages/devel/bioc/html/flowClean.html

• FlowSOM()

Van Gassen S, Callebaut B and Saeys Y (2017). FlowSOM: Using self-organizing maps for visualization and interpretation of cytometry data. http://www.r-project.org, http://dambi.ugent.be.

More Information available at: http://bioconductor.org/packages/release/bioc/html/FlowSOM.html

Additionally, you can check out the following link for more information on R Packages related to Flow Cytometry: http://bioconductor.org/packages/devel/BiocViews.html#___FlowCytometry

Good luck – and remember learning R is like learning a language - the more you use it the easier it becomes!

http://bioconductor.org/packages/devel/bioc/html/flowAI.html

http://bioconductor.org/packages/devel/bioc/html/flowClean.html

http://bioconductor.org/packages/release/bioc/html/FlowSOM.html

http://bioconductor.org/packages/devel/BiocViews.html#___FlowCytometry

statistics toolbox in - university of albertalkgray/uploads/7/3/6/2/... · r advantages: • state...

Documents