correlation and simple linear...

Chapter 14

Correlation and simple linearregression

Contents14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87914.2 Graphical displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 880

14.2.1 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88014.2.2 Smoothers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 881

14.3 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88414.3.1 Scatter-plot matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88414.3.2 Correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88714.3.3 Cautions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 88814.3.4 Principles of Causation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 891

14.4 Single-variable regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89214.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89214.4.2 Equation for a line - getting notation straight (no pun intended) . . . . . . . . 89314.4.3 Populations and samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89314.4.4 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89414.4.5 Obtaining Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89714.4.6 Obtaining Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89814.4.7 Residual Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89914.4.8 Example - Yield and fertilizer . . . . . . . . . . . . . . . . . . . . . . . . . . 89914.4.9 Example - Mercury pollution . . . . . . . . . . . . . . . . . . . . . . . . . . 90914.4.10 Example - The Anscombe Data Set . . . . . . . . . . . . . . . . . . . . . . . 91614.4.11 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91614.4.12 Example: Monitoring Dioxins - transformation . . . . . . . . . . . . . . . . . 91814.4.13 Example: Weight-length relationships - transformation . . . . . . . . . . . . . 93014.4.14 Power/Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 94114.4.15 The perils of R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 944

14.5 A no-intercept model: Fulton’s Condition Factor K . . . . . . . . . . . . . . . . . 94714.6 Frequent Asked Questions - FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . 952

14.6.1 Do I need a random sample; power analysis . . . . . . . . . . . . . . . . . . 95214.7 Summary of simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . 954

878

CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION

The suggested citation for this chapter of notes is:

Schwarz, C. J. (2019). Correlation and simple linear regression.In Course Notes for Beginning and Intermediate Statistics.Available at http://www.stat.sfu.ca/~cschwarz/CourseNotes. Retrieved2019-11-04.

14.1 Introduction

A nice book explaining how to use JMP to perform regression analysis is: Freund, R., Littell, R., andCreighton, L. (2003) Regression using JMP. Wiley Interscience.

Much of statistics is concerned with relationships among variables and whether observed relation-ships are real or simply due to chance. In particular, the simplest case deals with the relationship betweentwo variables.

Quantifying the relationship between two variables depends upon the scale of measurement of eachof the two variables. The following table summarizes some of the important analyses that are oftenperformed to investigate the relationship between two variables.

Type of vari-ables

X is Interval or Ratio orwhat JMP calls Continu-ous

X is Nominal or Ordinal

Y is Inter-val or Ratioor what JMPcalls Contin-uous

• Scatterplots

• Running me-dian/spline fit

• Regression

• Correlation

• Side-by-side dot plot

• Side-by-side boxplot

• ANOVA or t-tests

Y is Nomi-nal or Ordi-nal

• Logistic regression • Mosaic chart

• Contingency tables

• Chi-square tests

In JMP these combination of two variables are obtained by the Analyze->Fit Y-by-X platform, theAnalyze->Correlation-of-Ys platform, or the Analyze->Fit Model platform.

When analyzing two variables, one question becomes important as it determines the type of analysisthat will be done. Is the purpose to explore the nature of the relationship, or is the purpose to use onevariable to explain variation in another variable? For example, there is a difference between examiningheight and weight to see if there is a strong relationship, as opposed to using height to predict weight.

c©2019 Carl James Schwarz 879 2019-11-04

http://www.stat.sfu.ca/~cschwarz/CourseNotes


Consequently, you need to distinguish between a correlational analysis in which only the strength ofthe relationship will be described, or regression where one variable will be used to predict the values ofa second variable.

The two variables are often called either a response variable or an explanatory variable. A responsevariable (also known as a dependent or Y variable) measures the outcome of a study. An explana-tory variable (also known as an independent or X variable) is the variable that attempts to explain theobserved outcomes.

14.2 Graphical displays

14.2.1 Scatterplots

The scatter-plot is the primary graphical tool used when exploring the relationship between two intervalor ratio scale variables. This is obtained in JMP using the Analyze->Fit Y-by-X platform – be sure thatboth variables have a continuous scale.

In graphing the relationship, the response variable is usually plotted along the vertical axis (the Yaxis) and the explanatory variables is plotted along the horizontal axis (the X axis). It is not alwaysperfectly clear which is the response and which is the explanatory variable. If there is no distinctionbetween the two variables, then it doesn’t matter which variable is plotted on which axis – this usuallyonly happens when finding correlation between variables is the primary purpose.

For example, look at the relationship between calories/serving and fat from the cereal dataset usingJMP. [We will create the graph in class at this point.]

What to look for in a scatter-plot

Overall pattern. - What is the direction of association? A positive association occurs when above-average values of one variable tend to be associated with above-average variables of another. Theplot will have an upward slope. A negative association occurs when above-average values ofone variable are associated with below-average values of another variable. The plot will have adownward slope. What happens when there is “no association” between the two variables?

Form of the relationship. Does a straight line seem to fit through the ‘middle’ of the points? Is the linelinear (the points seem to cluster around a straight line?) or is it curvi-linear (the points seem toform a curve)?

Strength of association. Are the points clustered tightly around the curve? If the points have a lotof scatter above and below the trend line, then the association is not very strong. On the otherhand, if the amount of scatter above and below the trend line is very small, then there is a strongassociation.

Outliers Are there any points that seem to be unusual? Outliers are values that are unusually far fromthe trend curve - i.e., they are further away from the trend curve than you would expect from theusual level of scatter. There is no formal rule for detecting outliers - use common sense. [If youset the role of a variable to be a label, and click on points in a linked graph, the label for the pointwill be displayed making it easy to identify such points.]

One’s usual initial suspicion about any outlier is that it is a mistake, e.g., a transcription error.Every effort should be made to trace the data back to its original source and correct the value ifpossible. If the data value appears to be correct, then you have a bit of a quandary. Do you keep



the data point in even though it doesn’t follow the trend line, or do you drop the data point becauseit appears to be anomalous? Fortunately, with computers it is relatively easy to repeat an analysiswith and without an outlier - if there is very little difference in the final outcome - don’t worryabout it.

In some cases, the outliers are the most interesting part of the data. For example, for many yearsthe ozone hole in the Antarctic was missed because the computers were programmed to ignorereadings that were so low that ‘they must be in error’!

Lurking variables. A lurking variable is a third variable that is related to both variables and may con-found the association.

For example, the amount of chocolate consumed in Canada and the number of automobile acci-dents are positively related, but most people would agree that this is coincidental and each variableis independently driven by population growth.

Sometimes the lurking variable is a ’grouping’ variable of sorts. This is often examined by usinga different plotting symbol to distinguish between the values of the third variables. For example,consider the following plot of the relationship between salary and years of experience for nurses.

The individual lines show a positive relationship, but the overall pattern when the data are pooled,shows a negative relationship.

It is easy in JMP to assign different plotting symbols (what JMP calls markers) to different points.From the Row menu, use Where to select rows. Then assign those rows using the Rows->Markersmenu.

14.2.2 Smoothers

Once the scatter-plot is plotted, it is natural to try and summarize the underlying trend line. For example,consider the following data:



There are several common methods available to fit a line through this data.

By eye The eye has remarkable power for providing a reasonable approximation to an underlyingtrend, but it needs a little education. A trend curve is a good summary of a scatter-plot if the differencesbetween the individual data points and the underlying trend line (technically called residuals) are small.As well, a good trend curve tries to minimize the total of the residuals. And the trend line should try andgo through the middle of most of the data.

Although the eye often gives a good fit, different people will draw slightly different trend curves.Several automated ways to derive trend curves are in common use - bear in mind that the best ways ofestimating trend curves will try and mimic what the eye does so well.

Median or mean trace The idea is very simple. We choose a “window” width of size w, say. Foreach point along the bottom (X) axis, the smoothed value is the median or average of the Y -values forall data points with X-values lying within the “window” centered on this point. The trend curve is thenthe trace of these medians or means over the entire plot. The result is not exactly smooth. Generally,the wider the window chosen the smoother the result. However, wider windows make the smootherreact more slowly to changes in trend. Smoothing techniques are too computationally intensive to beperformed by hand. Unfortunately, JMP is unable to compute the trace of data, but splines are a verygood alternative (see below).

The mean or median trace is too unsophisticated to be a generally useful smoother. For example,the simple averaging causes it to under-estimate the heights of peaks and over-estimate the heights oftroughs. (Can you see why this is so? Draw a picture with a peak.) However, it is a useful way of tryingto summarize a pattern in a weak relationship for a moderately large data set. In a very weak relationshipit can even help you to see the trend.

Box plots for strips The following gives a conceptually simple method which is useful for exploringa weak relationship in a large data set. The X-axis is divided into equal-sized intervals. Then separatebox plots of the values of Y are found for each strip. The box-plots are plotted side-by-side and the meansor median are joined. Again, we are able to see what is happening to the variability as well as the trend.There is even more detailed information available in the box plots about the shape of the Y -distributionetc. Again, this is too tedious to do by hand. It is possible to do make this plot in JMP by creating a newvariable that groups the values of the X variable into classes and then using the Analyze->Fit Y-by-X



platform using these groupings. This is illustrated below:

Spline methods A spline is a series of short smooth curves that are joined together to create a largersmooth curve. The computational details are complex, but can be done in JMP. The stiffness of thespline indicates how straight the resulting curve will be. The following shows two spline fits to the samedata with different stiffness measures:



14.3 Correlation

WARNING!: Correlation is probably the most abused concept in statistics. Many people usethe word ‘correlation’ to mean any type of association between two variables, but it has a very stricttechnical meaning, i.e. the strength of an apparent linear relationship between the two interval or ratioscaled variables.

The correlation measure does not distinguish between explanatory and response variables and ittreats the two variables symmetrically. This means that the correlation between Y and X is the same asthe correlation between X and Y.

Correlations are computed in JMP using the Analyze->Correlation of Y’s platform. If there areseveral variables, then the data will be organized into a table. Each cell in the table shows the correla-tion of the two corresponding variables. Because of symmetry (the correlation between variable1 andvariable2 is the same as between variable2 and variable1), only part of the complete matrix will beshown. As well, the correlation between any variable and itself is always 1.

14.3.1 Scatter-plot matrix

To illustrate the ideas of correlation, look at the FITNESS dataset in the DATAMORE directory of JMP.This is a dataset on 31 people at a fitness centre and the following variables were measured on eachsubject:

• name

• gender



• age

• weight

• oxygen consumption (high values are typically more fit people)

• time to run one mile (1.6 km)

• average pulse rate during the run

• the resting pulse rate

• maximum pulse rate during the run.

We are interested in examining the relationship among the variables. At the moment, ignore the factthat the data contains both genders. [It would be interesting to assign different plotting symbols to thetwo genders to see if gender is a lurking variable.]

One of the first things to do is to create a scatter-plot matrix of all the variables. Use the Analyze->Correlation of Ys to get the following scatter-plot:



Interpreting the scatter plot matrix

The entries in the matrix are scatter-plots for all the pairs of variables. For example, the entry in row1 column 3 represents the scatter-plot between age and oxygen consumption with age along the verticalaxis and oxygen consumption along the horizontal axis, while the entry in row 3 column 1 has age alongthe horizontal axis and oxygen consumption along the vertical axis.

There is clearly a difference in the ’strength’ of relationships. Compare the scatter plot for averagerunning pulse rate and maximum pulse rate (row 5, column 7) to that of running pulse rate and restingpulse rate (row 5 column 6) to that of running pulse rate and weight (row 5 column 2).

Similarly, there is a difference in the direction of association. Compare the scatter plot for the averagerunning pulse rate and maximum pulse rate (row 5 column 7) and that for oxygen consumption andrunning time (row 3, column 4).



14.3.2 Correlation coefficient

It is possible to quantify the strength of association between two variables. As with all statistics, the waythe data are collected influences the meaning of the statistics.

The population correlation coefficient between two variables is denoted by the Greek letter rho (ρ)and is computed as:.

ρ =1

N

N∑i=1

(Xi − µX)

σx

(Yi − µY )

σy

The corresponding sample correlation coefficient is denoted r has a similar form:1

r =1

n− 1

n∑i=1

(Xi −X

)sx

(Yi − Y

)sy

If the sampling scheme is simple random sample from the corresponding population, then r is anestimate of ρ. This is a crucial assumption. If the sampling is not a simple randomsample, the above definition of the sample correlation coefficient should not be used! It is possible tofind a confidence interval for ρ and to perform statistical tests that ρ is zero. However, for the most part,these are rarely done in ecological research and so will not be pursued further in this course.

The form of the formula does provide some insight into interpreting its value.

• ρ and r (unlike other population parameters) are unitless measures.

• the sign of ρ and r is largely determined by the pairing of the relationship of each of the (X,Y)values with their respective means, i.e. if both of X and Y are above the mean, or both X and Y arebelow their mean, this pair contributes a positive value towards ρ or r, while if X is above and Yis below, or X is below and Y is above their respective means contributes a negative value towardsρ or r.

• ρ and r ranges from -1 to 1. A value of ρ or r equal to -1 implies a perfect negative correlation; avalue of ρ or r equal to 1 implies a perfect positive correlation; a value of ρ or r equal to 0 impliedno correlation. A perfect population correlation (i.e. ρ or r equal to 1 or -1) implies that all pointslie exactly on a straight line, but the slope of the line has NO effect on the correlation coefficient.This latter point is IMPORTANT and often is wrongly interpreted - give some examples.

• ρ and r are unaffected by linear transformations of the individual variables, e.g. unit changes suchas converting from imperial to metric units.

• ρ and r only measures the linear association, and is not affected by the slope of the line, but onlyby the scatter about the line.

Because correlation assumes both variables have an interval or ratio scale, it makes no sense tocompute the correlation

• between gender and oxygen (gender is a nominal scale data);

• between non-linear variables (not shown on graph);

1 Note that this formula SHOULD NOT be used for the actual computation of r, it is numerically unstable and there are bettercomputing formulae available.



• for data collected without a known probability scheme. If a sampling scheme other than simplerandom sampling is used, it is possible to modify the estimation formula; if a non-probabilitysample scheme was used, the patient is dead on arrival, and no amount of statistical wizardry willrevive the corpse.

The data collection scheme for the fitness data set is unknown - we will have to assume that a somesort of random sample form the relevant population was taken before we can make much sense of thenumber computed.

Before looking at the details of its computation, look at the sample correlation coefficients for eachscatter plot above. These can be arranged into a matrix:

Variable Age Weight Oxy Runtime RunPulse RstPulse MaxPulse

Age 1.00 -0.24 -0.31 0.19 -0.31 -0.15 -0.41

Weight -0.24 1.00 -0.16 0.14 0.18 0.04 0.24

Oxy -0.31 -0.16 1.00 -0.86 -0.39 -0.39 -0.23

Runtime 0.19 0.14 -0.86 1.00 0.31 0.45 0.22

RunPulse -0.31 0.18 -0.39 0.31 1.00 0.35 0.92

RstPulse -0.15 0.04 -0.39 0.45 0.35 1.00 0.30

MaxPulse -0.41 0.24 -0.23 0.22 0.92 0.30 1.00

Notice that the sample correlation between any two variables is the same regardless of ordering ofthe variables – this explains the symmetry in the matrix between the above and below diagonal elements.As well each variable has a perfect sample correlation with itself – this explains the value of 1 along themain diagonal.

Compare the sample correlations between the average running pulse rate and the other variables andcompare them to the corresponding scatter-plot above.

14.3.3 Cautions

• Random Sampling Required Sample correlation coefficients are only valid under simple ran-dom samples. If the data were collected in a haphazard fashion or if certain data points wereoversampled, then the correlation coefficient may be severely biased.

• There are examples of high correlation but no practical use and low correlation but great practicaluse. These will be presented in class. This illustrates why I almost never talk about correlation.

• correlation measures ‘strength’ of a linear relationship; a curvilinear relationship may have acorrelation of 0, but there will still be a good correlation.



• the effect of outliers and high leverage points will be presented in class

• effects of lurking variables. For example, suppose there is a positive association between wagesof male nurses and years of experience; between female nurses and years of experience; but malesare generally paid more than females. There is a positive correlation within each group, but anoverall negative correlation when the data are pooled together.



• ecological fallacy - the problem of correlation applied to averages. Even if there is a high correla-tion between two variables on their averages, it does not imply that there is a correlation betweenindividual data values.

For example, if you look at the average consumption of alcohol and the consumption of cigarettes,there is a high correlation among the averages when the 12 values from the provinces and territoriesare plotted on a graph. However, the individual relationships within provinces can be reversed ornon-existent as shown below:

The relationship between cigarette consumption and alcohol consumption shows no relationship



for each province, yet there is a strong correlation among the per-capita averages. This is anexample of the ecological fallacy.

• correlation does not imply causation. This is the most frequent mistake made by people. Thereare set of principles of causal inference that need to be satisfied in order to imply cause and effect.

14.3.4 Principles of Causation

Types of association

An association may be found between two variables for several reasons (show causal modeling fig-ures):

• There may be direct causation, e.g. smoking causes lung cancer.

• There may be a common cause, e.g. ice cream sales and number of drownings both increase withtemperature.

• There may be a confounding factor, e.g. highway fatalities decreased when the speed limits werereduced to 55 mph at the same time that the oil crisis caused supplies to be reduced and peopledrove fewer miles.

• There may be a coincidence, e.g., the population of Canada has increased at the same time as themoon has gotten closer by a few miles.

Establishing cause-and effect.

How do we establish a cause and effect relationship? Bradford Hill (Hill, A. B.. 1971. Principles ofMedical Statistics, 9th ed New York: Oxford University Press) outlined 7 criteria that have been adoptedby many epidemiological researchers. It is generally agreed that most or all of the following must beconsidered before causation can be declared.

Strength of the association. The stronger an observed association appears over a series of differentstudies, the less likely this association is spurious because of bias.

Dose-response effect. The value of the response variable changes in a meaningful way with the dose(or level) of the suspected causal agent.

Lack of temporal ambiguity. The hypothesized cause precedes the occurrence of the effect. The abilityto establish this time pattern will depend upon the study design used.

Consistency of the findings. Most, or all, studies concerned with a given causal hypothesis producesimilar findings. Of course, studies dealing with a given question may all have serious bias prob-lems that can diminish the importance of observed associations.

Biological or theoretical plausibility. The hypothesized causal relationship is consistent with currentbiological or theoretical knowledge. Note, that the current state of knowledge may be insufficientto explain certain findings.

Coherence of the evidence. The findings do not seriously conflict with accepted facts about the out-come variable being studied.

Specificity of the association. The observed effect is associated with only the suspected cause (or fewother causes that can be ruled out).



IMPORTANT: NO CAUSATION WITHOUT MANIPULATION!

Examples:

Discuss the above in relation to:

• amount of studying vs. grades in a course.

• amount of clear cutting and sediments in water.

• fossil fuel burning and the greenhouse effect.

14.4 Single-variable regression

14.4.1 Introduction

Along with the Analysis of Variance, this is likely the most commonly used statistical methodology inecological research. In virtually every issue of an ecological journal, you will find papers that use aregression analysis.

There are HUNDREDS of books written on regression analysis. Some of the better ones (IMHO)are:

Draper and Smith. Applied Regression Analysis. Wiley.Neter, Wasserman, and Kutner. Applied Linear Statistical Models. Irwin.Kleinbaum, Kupper, Miller. Applied Regression Analysis. Duxbury.Zar. Biostatistics. Prentice Hall.

Consequently, this set of notes is VERY brief and makes no pretense to be a thorough review ofregression analysis. Please consult the above references for all the gory details.

It turns out that both Analysis of Variance and Regression are special cases of a more general statis-tical methodology called General Linear Models which in turn are special cases of Generalized LinearModels which in turn are special cases of Generalized Additive Models, which in turn are special casesof .....

The key difference between a Regression analysis and an ANOVA is that the X variable is nominalscaled in ANOVA, while in regression analysis the X variable is continuous scaled. This implies that inANOVA, the shape of the response profile was unspecified (the null hypothesis was that all means wereequal while the alternate was that at least one mean differs), while in regression, the response profilemust be a linear line.

Because both ANOVA and regression are from the same class of statistical models, many of theassumptions are similar, the fitting methods are similar, hypotheses testing and inference are similar aswell.



14.4.2 Equation for a line - getting notation straight (no pun intended)

In order to use regression analysis effectively, it is important that you understand the concepts of slopesand intercepts and how to determine these from data values.

This will be QUICKLY reviewed here in class.

In previous courses at high school or in linear algebra, the equation of a straight line was often writteny = mx + b where m is the slope and b is the intercept. In some popular spreadsheet programs, theauthors decided to write the equation of a line as y = a+ bx. Now a is the intercept, and b is the slope.Statisticians, for good reasons, have rationalized this notation and usually write the equation of a lineas y = β0 + β1x or as Y = b0 + b1X . (the distinction between β0 and b0 will be made clearer in afew minutes). The use of the subscripts 0 to represent the intercept and the subscript 1 to represent thecoefficient for the X variable then readily extends to more complex cases.

Review definition of intercept as the value of Y when X=0, and slope as the change in Y per unitchange in X .

14.4.3 Populations and samples

All of statistics is about detecting signals in the face of noise and in estimating population parametersfrom samples. Regression is no different.

First consider the the population. As in previous chapters, the correct definition of the population isimportant as part of any study. Conceptually, we can think of the large set of all units of interest. On eachunit, there is conceptually, both an X and Y variable present. We wish to summarize the relationshipbetween Y andX , and furthermore wish to make predictions of the Y value for futureX values that maybe observed from this population. [This is analogous to having different treatment groups correspondingto different values of X in ANOVA.]

If this were physics, we may conceive of a physical law between X and Y , e.g. F = ma orPV = nRt. However, in ecology, the relationship between Y and X is much more tenuous. If youcould draw a scatter-plot of Y against X for ALL elements of the population, the points would NOT fallexactly on a straight line. Rather, the value of Y would fluctuate above or below a straight line at anygiven X value. [This is analogous to saying that Y varies randomly around the treatment group mean inANOVA.]

We denote this relationship asY = β0 + β1X + ε

where now β0, β1 are the POPULATION intercept and slope respectively. We say that

E[Y ] = β0 + β1X

is the expected or average value of Y at X . [In ANOVA, we let each treatment group have its own mean;here in regression we assume that the means must fit on a straight line.]

The term ε represent random variation of individual units in the population above and below theexpected value. It is assumed to have constant standard deviation over the entire regression line (i.e. thespread of data points in the population is constant over the entire regression line). [This is analogous tothe assumption of equal treatment population standard deviations in ANOVA.]

Of course, we can never measure all units of the population. So a sample must be taken in order to



estimate the population slope, population intercept, and population standard deviation. Unlike a correla-tion analysis, it is NOT necessary to select a simple random sample from the entire population and moreelaborate schemes can be used. The bare minimum that must be achieved is that for any individual Xvalue found in the sample, the units in the population that share this X value, must have been selected atrandom.

This is quite a relaxed assumption! For example, it allows us to deliberately choose values of X fromthe extremes and then only at those X value, randomly select from the relevant subset of the population,rather than having to select at random from the population as a whole. [This is analogous to the as-sumptions made in an analytical survey, where we assumed that even though we can’t randomly assigna treatment to a unit (e.g. we can’t assign sex to an animal) we must ensure that animals are randomlyselected from each group].

Once the data points are selected, the estimation process can proceed, but not before assessing theassumptions!

14.4.4 Assumptions

The assumptions for a regression analysis are very similar to those found in ANOVA.

Linearity

Regression analysis assume that the relationship between Y andX is linear. Make a scatter-plot betweenY and X to assess this assumption. Perhaps a transformation is required (e.g. log(Y ) vs. log(X)). Somecaution is required in transformation in dealing with the error structure as you will see in later examples.

Plot residuals vs. the X values. If the scatter is not random around 0 but shows some pattern (e.g.quadratic curve), this usually indicates that the relationship between Y and X is not linear. Or, fit a modelthat includes X and X2 and test if the coefficient associated with X2 is zero. Unfortunately, this testcould fail to detect a higher order relationship. Third, if there are multiple readings at some X-values,then a test of goodness-of-fit can be performed where the variation of the responses at the same X valueis compared to the variation around the regression line.

Correct scale of predictor and response

The response and predictor variables must both have interval or ratio scale. In particular, using a numer-ical value to represent a category and then using this numerical value in a regression is not valid. Forexample, suppose that you code hair color as (1 = red, 2 = brown, and 3 = black). Then using thesevalues in a regression either as predictor variable or as a response variable is not sensible.

Correct sampling scheme

The Y must be a random sample from the population of Y values for every X value in the sample.Fortunately, it is not necessary to have a completely random sample from the population as the regressionline is valid even if the X values are deliberately chosen. However, for a given X , the values from thepopulation must be a simple random sample.



No outliers or influential points

All the points must belong to the relationship – there should be no unusual points. The scatter-plot of Yvs. X should be examined. If in doubt, fit the model with the points in and out of the fit and see if thismakes a difference in the fit.

Outliers can have a dramatic effect on the fitted line. For example, in the following graph, the singlepoint is an outlier and and influential point:

Equal variation along the line

The variability about the regression line is similar for all values of X , i.e. the scatter of the points aboveand below the fitted line should be roughly constant over the entire line. This is assessed by looking atthe plots of the residuals against X to see if the scatter is roughly uniformly scattered around zero withno increase and no decrease in spread over the entire line.

Independence

Each value of Y is independent of any other value of Y . The most common cases where this fail are timeseries data where X is a time measurement. In these cases, time series analysis should be used.

This assumption can be assessed by again looking at residual plots against time or other variables.

Normality of errors

The difference between the value of Y and the expected value of Y is assumed to be normally dis-tributed. This is one of the most misunderstood assumptions. Many people erroneously assume that thedistribution of Y over all X values must be normally distributed, i.e they look simply at the distributionof the Y ’s ignoring the Xs. The assumption only states that the residuals, the difference between thevalue of Y and the point on the line must be normally distributed.

This can be assessed by looking at normal probability plots of the residuals. As in ANOVA, for smallsample sizes, you have little power of detecting non-normality and for large sample sizes it is not thatimportant.



X measured without error

This is a new assumption for regression as compared to ANOVA. In ANOVA, the group membershipwas always “exact”, i.e. the treatment applied to an experimental unit was known without ambiguity.However, in regression, it can turn out that that the X value may not be known exactly.

This general problem is called the “error in variables” problem and has a long history in statistics.

It turns out that there are two important cases. If the value reported for X is a nominal value and theactual value of X varies randomly around this nominal value, then there is no bias in the estimates. Thisis called the Berkson case after Berkson who first examined this situation. The most common cases arewhere the recorded X is a target value (e.g. temperature as set by a thermostat) while the actual X thatoccurs would vary randomly around this target value.

However, if the value used for X is an actual measurement of the underlying X then there is uncer-tainty in both the X and Y direction. In this case, estimates of the slope are attenuated towards zero (i.e.positive slopes are biased downwards, negative slopes biased upwards). More alarmingly, the estimateare no longer consistent, i.e. as the sample size increases, the estimates no longer tend to the populationvalues! For example, suppose that yield of a crop is related to amount of rainfall. A rain gauge may notbe located exactly at the plot where the crop is grown, but may be recorded a nearby weather station afair distance away. The reading at the weather station is NOT a true reflection of the rainfall at the testplot.

This latter case of “error in variables” is very difficult to analyze properly and there are not universallyaccepted solutions. Refer to the reference books listed at the start of this chapter for more details.

The problem is set up as follows. Let

Yi =ηi + εi

Xi =ξi + δi

with the straight-line relationship between the population (but unobserved) values:

ηi =β0 + β1ξi

Note the (population, but unknown) regression equation uses ξi rather than the observed (with error)values Xi.

Now if the regression is done on the observed X (i.e. the error prone measurement), the regressionequation reduces to:

Yi = β0 + β1Xi + (εi − β1δi)

Now this violates the independence assumption of ordinary least squares because the new “error” termis not independent of the Xi variable.

If an ordinary least squares model is fit, the estimated slope is biased (Draper and Smith, 1998, p.90) with

E[β1] = β1 −β1r(ρ+ r)

1 + 2ρr + r2

where ρ is the correlation between ξ and δ; and r is the ratio of the variance of the error in X to the errorin Y .

The bias is negative, i.e. the estimated slope is too small, in most practical cases (ρ+ r > 0). This isknown as attenuation of the estimate, and in general, pulls the estimate towards zero.



The bias will be small in the following cases:

• the error variance of X is small relative to the error variance in Y . This means that r is small (i.e.close to zero), and so the bias is also small. In the case where X is measured without error, thenr = 0 and the bias vanishes as expected.

• if the X are fixed (the Berkson case) and actually used2, then ρ+ r = 0 and the bias also vanishes.

The proper analysis of the error-in-variables case is quite complex – see Draper and Smith (1998, p.91) for more details.

14.4.5 Obtaining Estimates

To distinguish between population parameters and sample estimates, we denote the sample intercept byb0 and the sample slope as b1. The equation of a particular sample of points is expressed Yi = b0 + b1Xi

where b0 is the estimated intercept, and b1 is the estimated slope. The symbol Y indicates that we arereferring to the estimated line and not to a line in the entire population.

How is the best fitting line found when the points are scattered? We typically use the principle ofleast squares. The least-squares line is the line that makes the sum of the squares of the deviations ofthe data points from the line in the vertical direction as small as possible.

Mathematically, the least squares line is the line that minimizes 1n

∑(Yi − Yi

)2

where Yi is thepoint on the line corresponding to each X value. This is also known as the predicted value of Y for agiven value ofX . This formal definition of least squares is not that important - the concept as expressed inthe previous paragraph is more important – in particular it is the SQUARED deviation in the VERTICALdirection that is used..

It is possible to write out a formula for the estimated intercept and slope, but who cares - let thecomputer do the dirty work.

The estimated intercept (b0) is the estimated value of Y whenX = 0. In some cases, it is meaninglessto talk about values of Y whenX = 0 becauseX = 0 is nonsensical. For example, in a plot of income vs.year, it seems kind of silly to investigate income in year 0. In these cases, there is no clear interpretationof the intercept, and it merely serves as a placeholder for the line.

The estimated slope (b1) is the estimated change in Y per unit change in X . For every unit changein the horizontal direction, the fitted line increased by b1 units. If b1 is negative, the fitted line pointsdownwards, and the increase in the line is negative, i.e., actually a decrease.

As with all estimates, a measure of precision can be obtained. As before, this is the standard errorof each of the estimates. Again, there are computational formulae, but in this age of computers, theseare not important. As before, approximate 95% confidence intervals for the corresponding populationparameters are found as estimate ± 2× se.

Formal tests of hypotheses can also be done. Usually, these are only done on the slope parameteras this is typically of most interest. The null hypothesis is that population slope is 0, i.e. there is norelationship between Y andX (can you draw a scatter-plot showing such a relationship?) More formally

2For example, a thermostat measures (with error) the actual temperature of a room. But if the experiment is based on thethermostat readings rather than the (true) unknown temperature, this corresponds to the Berkson case.



the null hypothesis is:H : β1 = 0

Again notice that the null hypothesis is ALWAYS in terms of a population parameter and not in terms ofa sample statistic.

The alternate hypothesis is typically chosen as:

A : β1 6= 0

although one-sided tests looking for either a positive or negative slope are possible.

The test-statistics is found asT =

b1 − 0

se(b1)

and is compared to a t-distribution with appropriate degrees of freedom to obtain the p-value. Thisis usually automatically done by most computer packages. The p-value is interpreted in exactly thesame way as in ANOVA, i.e. is measures the probability of observing this data if the hypothesis of norelationship were true.

As before, the p-value does not tell the whole story, i.e. statistical vs. biological (non)significancemust be determined and assessed.

14.4.6 Obtaining Predictions

Once the best fitting line is found it can be used to make predictions for new values of X .

There are two types of predictions that are commonly made. It is important to distinguish betweenthem as these two intervals are the source of much confusion in regression problems.

First, the experimenter may be interested in predicting a SINGLE future individual value for a par-ticular X . Second the experimenter may be interested in predicting the AVERAGE of ALL futureresponses at a particular X .3 The prediction interval for an individual response is sometimes called aconfidence interval for an individual response but this is an unfortunate (and incorrect) use of the termconfidence interval. Strictly speaking confidence intervals are computed for fixed unknown parametervalues; predication intervals are computed for future random variables.

Both of the above intervals should be distinguished from the confidence interval for the slope.

In both cases, the estimate is found in the same manner – substitute the new value ofX into the equa-tion and compute the predicted value Y . In most computer packages this is accomplished by inserting anew “dummy” observation in the dataset with the value of Y missing, but the value of X present. Themissing Y value prevents this new observation from being used in the fitting process, but the X valueallows the package to compute an estimate for this observation.

What differs between the two predictions are the estimates of uncertainty.

In the first case, there are two sources of uncertainty involved in the prediction. First, there is theuncertainty caused by the fact that this estimated line is based upon a sample. Then there is the additionaluncertainty that the value could be above or below the predicted line. This interval is often called aprediction interval at a new X .

3There is actually a third interval, the mean of the next “m” individuals values but this is rarely encountered in practice.



In the second case, only the uncertainty caused by estimating the line based on a sample is relevant.This interval is often called a confidence interval for the mean at a new X .

The prediction interval for an individual response is typically MUCH wider than the confidenceinterval for the mean of all future responses because it must account for the uncertainty from the fittedline plus individual variation around the fitted line.

Many textbooks have the formulae for the se for the two types of predictions, but again, there islittle to be gained by examining them. What is important is that you read the documentation carefully toensure that you understand exactly what interval is being given to you.

14.4.7 Residual Plots

After the curve is fit, it is important to examine if the fitted curve is reasonable. This is done usingresiduals. The residual for a point is the difference between the observed value and the predicted value,i.e., the residual from fitting a straight line is found as: residuali = Yi − (b0 + b1Xi) = (Yi − Yi).

There are several standard residual plots:

• plot of residuals vs. predicted (Y );

• plot of residuals vs. X;

• plot of residuals vs. time ordering.

In all cases, the residual plots should show random scatter around zero with no obvious pattern.Don’t plot residual vs. Y - this will lead to odd looking plots which are an artifact of the plot and don’tmean anything.

14.4.8 Example - Yield and fertilizer

We wish to investigate the relationship between yield (Liters) and fertilizer (kg/ha) for tomato plants. Anexperiment was conducted in the Schwarz household one summer on 11 plots of land where the amountof fertilizer was varied and the yield measured at the end of the season.

The amount of fertilizer (randomly) applied to each plot was chosen between 5 and 18 kg/ha. Whilethe levels were not systematically chosen (e.g. they were not evenly spaced between the highest andlowest values), they represent commonly used amounts based on a preliminary survey of producers. Atthe end of the experiment, the yields were measured and the following data were obtained.

Interest also lies in predicting the yield when 16 kg/ha are assigned.



Fertilizer Yield(kg/ha) (Liters)

12 24

5 18

15 31

17 33

14 30

6 20

11 25

13 27

15 31

8 21

18 29

The data is available in the fertilizer.csv file in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets. The data are imported into R usingthe read.csv() function:

plants <- read.csv("fertilizer.csv", header=TRUE,as.is=TRUE, strip.white=TRUE,na.string=".")

head(plants)str(plants)

The raw data are shown below:

fertilizer yield1 5 182 6 203 8 214 11 255 12 246 13 277 14 308 15 319 15 3110 17 3311 18 2912 16 NA

In in this study, it is quite clear that the fertilizer is the predictor (X) variable, while the responsevariable (Y ) is the yield.

The population consists of all possible field plots with all possible tomato plants of this type grownunder all possible fertilizer levels between about 5 and 18 kg/ha.

If all of the population could be measured (which it can’t) you could find a relationship betweenthe yield and the amount of fertilizer applied. This relationship would have the form: Y = β0 + β1 ×


http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets



(amount of fertilizer) + ε where β0 and β1 represent the population intercept and population slope re-spectively. The term ε represents random variation that is always present, i.e. even if the same plot wasgrown twice in a row with the same amount of fertilizer, the yield would not be identical (why?).

The population parameters to be estimated are β0 - the population average yield when the amount offertilizer is 0, and β1, the population average change in yield per unit change in the amount of fertilizer.These are taken over all plants in all possible field plots of this type. The values of β0 and β1 areimpossible to obtain as the entire population could never be measured.

The ordering of the rows in the data table is NOT important; however, it is often easier to findindividual data points if the data is sorted by the X value and the rows for future predictions are placedat the end of the dataset. Notice how missing values are represented.

Start by plotting the data

plotprelim <- ggplot2::ggplot(data=plants, aes(x=fertilizer, y=yield))+ggtitle("Yield vs. Fertilizer")+xlab("Fertilizer")+ylab("Yield")+geom_point(size=4)

plotprelim

The relationship look approximately linear; there don’t appear to be any outlier or influential points;the scatter appears to be roughly equal across the entire regression line. Residual plots will be used laterto check these assumptions in more detail.

The lm() function is used to fit regression lines. The first argument to the lm() function, yield ∼fertilizer is a “short-hand”’ for a statistical model. The response (Y ) variable is on the left of ∼ and thepredictor variable(s) are on the right of the ∼. The data=plants argument indicates that the variables in



the formula are found in the plants data frame.

plants.fit <- lm( yield ~ fertilizer, data=plants)anova(plants.fit)summary(plants.fit)

giving:

Analysis of Variance Table

Response: yieldDf Sum Sq Mean Sq F value Pr(>F)

fertilizer 1 225.180 225.180 69.88 1.555e-05 ***Residuals 9 29.001 3.222---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Call:lm(formula = yield ~ fertilizer, data = plants)

Residuals:Min 1Q Median 3Q Max

-3.6807 -0.5149 0.0289 1.5220 1.7248

Coefficients:Estimate Std. Error t value Pr(>|t|)

(Intercept) 12.8560 1.6938 7.590 3.36e-05 ***fertilizer 1.1014 0.1318 8.359 1.56e-05 ***---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 1.795 on 9 degrees of freedom(1 observation deleted due to missingness)

Multiple R-squared: 0.8859, Adjusted R-squared: 0.8732F-statistic: 69.88 on 1 and 9 DF, p-value: 1.555e-05

The estimated regression line is

Y = b0 + b1(fertilizer) = 12.856 + 1.10137(amount of fertilizer)

In terms of estimates, b0=12.856 is the estimated intercept, and b1=1.101 is the estimated slope.

The estimated slope is the estimated change in yield when the amount of fertilizer is increased by 1unit. In this case, the yield is expected to increase (why?) by 1.10137 L when the fertilizer amount isincreased by 1 kg/ha. NOTE that the slope is the CHANGE in Y when X increases by 1 unit - not thevalue of Y when X = 1.

The estimated intercept is the estimated yield when the amount of fertilizer is 0. In this case, the esti-mated yield when no fertilizer is added is 12.856 L. In this particular case the intercept has a meaningfulinterpretation, but I’d be worried about extrapolating outside the range of the observed X values. If the



intercept is 12.85, why does the line intersect the left part of the graph at about 15 rather than closer to13?

Once again, these are the results from a single experiment. If another experiment was repeated, youwould obtain different estimates (b0 and b1 would change). The sampling distribution over all possi-ble experiments would describe the variation in b0 and b1 over all possible experiments. The standarddeviation of b0 and b1 over all possible experiments is again referred to as the standard error of b0 andb1.

The formulae for the standard errors of b0 and b1 are messy, and hopeless to compute by hand.And just like inference for a mean or a proportion, the program automatically computes the se of theregression estimates.

The estimated standard error for b1 (the estimated slope) is 0.132 L/kg. This is an estimate of thestandard deviation of b1 over all possible experiments. Normally, the intercept is of limited interest, buta standard error can also be found for it as shown in the above table.

Using exactly the same logic as when we found a confidence interval for the population mean, or forthe population proportion, a confidence interval for the population slope (β1) is found (approximately)as b1 ± 2(estimated se) In the above example, an approximate confidence interval for β1 is found as

1.101± 2× (.132) = 1.101± .264 = (.837 → 1.365)L/kg

of fertilizer applied.

confint(plants.fit)

giving:

2.5 % 97.5 %(Intercept) 9.0244198 16.687627fertilizer 0.8033275 1.399415

The “exact” confidence interval is based on the t-distribution and is slightly wider than our approximateconfidence interval because the total sample size (11 pairs of points) is rather small. We interpret thisinterval as ‘being 95% confident that the population increase in yield when the amount of fertilizer isincreased by one unit is somewhere between (.837 to 1.365) L/kg.’

Be sure to carefully distinguish between β1 and b1. Note that the confidence interval is computedusing b1, but is a confidence interval for β1 - the population parameter that is unknown.

In linear regression problems, one hypothesis of interest is if the population slope is zero. This wouldcorrespond to no linear relationship between the response and predictor variable (why?) Again, this isa good time to read the papers by Cherry and Johnson about the dangers of uncritical use of hypothesistesting. In many cases, a confidence interval tells the entire story.

R also produces a test of the hypothesis that each of the parameters (the slope and the intercept in thepopulation) is zero. The output is reproduced again below:

Call:



lm(formula = yield ~ fertilizer, data = plants)


-3.68071 -0.51494 0.02889 1.52204 1.72478


(Intercept) 12.8560 1.6938 7.590 3.36e-05 ***fertilizer 1.1014 0.1318 8.359 1.56e-05 ***---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1


Multiple R-squared: 0.8859, Adjusted R-squared: 0.8732F-statistic: 69.88 on 1 and 9 DF, p-value: 1.555e-05

The test of hypothesis about the intercept is not of interest (why?).

Let

• β1 be the population (unknown) slope.

• b1 be the estimated slope. In this case b1 = 1.1014.

The hypothesis testing proceeds as follows. Again note that we are interested in the populationparameters and not the sample statistics.

1. Specify the null and alternate hypothesis:H: β1 = 0

A: β1 6= 0.

Notice that the null hypothesis is in terms of the population parameter β1. This is a two-sided testas we are interested in detecting differences from zero in either direction.

2. Find the test statistic and the p-value. The test statistic is computed as:

T =estimate − hypothesized value

estimated se=

1.1014− 0

.132= 8.36

In other words, the estimate is over 8 standard errors away from the hypothesized value!

This will be compared to a t-distribution with n−2 = 9 degrees of freedom. The p-value is foundto very small (less than 0.0001).

3. Conclusion. There is strong evidence that the population slope is not zero. This is not too surpris-ing given that the 95% confidence intervals show that plausible values for the population slope arefrom about .8 to about 1.4.

It is possible to construct tests of the slope equal to some value other than 0. Most packages can’t dothis. You would compute the T value as shown above, replacing the value 0 with the hypothesized value.



It is also possible to construct one-sided tests. Most computer packages only do two-sided tests.Proceed as above, but the one-sided p-value is the two-sided p-value reported by the packages dividedby 2.

If sufficient evidence is found against the hypothesis, a natural question to ask is ‘well, what valuesof the parameter are plausible given this data’. This is exactly what a confidence interval tells you.Consequently, I usually prefer to find confidence intervals, rather than doing formal hypothesis testing.

What about making predictions for future yields when certain amounts of fertilizer are applied? Forexample, what would be the future yield when 16 kg/ha of fertilizer are applied?

The predicted value is found by substituting the new X into the estimated regression line.

Y = b0 + b1(fertilizer) = 12.856 + 1.10137(16) = 30.48 L

Predictions can be obtained from the fitted model using the predict() function. You should put thenew data into a new data frame.

# make predictions# First set up the points where you want predictionsnewfertilizers <- data.frame(fertilizer=seq(min(plants$fertilizer,na.rm=TRUE),

max(plants$fertilizer,na.rm=TRUE),1))newfertilizers[1:5,]str(newfertilizers)

The new data frame looks like:

[1] 5 6 7 8 9’data.frame’: 14 obs. of 1 variable:$ fertilizer: num 5 6 7 8 9 10 11 12 13 14 ...

There are two types of predictions that can be made as noted below. First predictions for the meanresponse at a new XL

# Predict the AVERAGE yield at each fertilizer# You need to specify help(predict.lm) tp see the documentationpredict.avg <- predict(plants.fit, newdata=newfertilizers,

se.fit=TRUE,interval="confidence")# This creates a list that you need to restructure to make it look nicepredict.avg.df <- cbind(newfertilizers, predict.avg$fit, se=predict.avg$se.fit)tail(predict.avg.df)

# Add the confidence intervals to the plotplotfit.avgci <- plotfit +

geom_ribbon(data=predict.avg.df, aes(x=fertilizer,y=NULL, ymin=lwr, ymax=upr),alpha=0.2)plotfit.avgci

giving:



fertilizer fit lwr upr9 13 27.17385 22.92548 31.4222210 14 28.27522 23.99938 32.5510611 15 29.37659 25.05286 33.7003312 16 30.47796 26.08658 34.8693413 17 31.57933 27.10146 36.0572114 18 32.68071 28.09854 37.26287

Second, predictions for individual responses:

# Predict the INDIVIDUAL yield at each fertilizer# R does not product the se for individual predictionspredict.indiv <- predict(plants.fit, newdata=newfertilizers,

interval="prediction")# This creates a list that you need to restructure to make it look nicepredict.indiv.df <- cbind(newfertilizers, predict.indiv)tail(predict.indiv.df)

# Add the prediction intervals to the plotplotfit.indivci <- plotfit.avgci +

geom_ribbon(data=predict.indiv.df, aes(x=fertilizer,y=NULL, ymin=lwr, ymax=upr),alpha=0.1)plotfit.indivci

giving:

fertilizer fit lwr upr9 13 27.17385 22.92548 31.4222210 14 28.27522 23.99938 32.5510611 15 29.37659 25.05286 33.7003312 16 30.47796 26.08658 34.8693413 17 31.57933 27.10146 36.0572114 18 32.68071 28.09854 37.26287

As noted earlier, there are two types of estimates of precision associated with predictions using theregression line. It is important to distinguish between them as these two intervals are the source of muchconfusion in regression problems.

First, the experimenter may be interested in predicting a single FUTURE individual value for aparticular X . This would correspond to the predicted yield for a single future plot with 16 kg/ha offertilizer added.

Second the experimenter may be interested in predicting the average of ALL FUTURE responses ata particular X . This would correspond to the average yield for all future plots when 16 kg/ha of fertilizeris added. The prediction interval for an individual response is sometimes called a confidence intervalfor an individual response but this is an unfortunate (and incorrect) use of the term confidence interval.Strictly speaking confidence intervals are computed for fixed unknown parameter values; predicationintervals are computed for future random variables.

The type types of intervals can be plotted:



The innermost set of lines represents the confidence bands for the mean response. The outermostband of lines represents the prediction intervals for a single future response. As noted earlier, the lattermust be wider than the former to account for an additional source of variation.

Here the predicted yield for a single future trial at 16 kg/ha is 30.5 L, but the 95% prediction intervalis between 26.1 and 34.9 L. The predicted AVERAGE yield for ALL future plots when 16 kg/ha offertilizer is applied is also 30.5 L, but the 95% confidence interval for the MEAN yield is between 28.8



and 32.1 L.

Residual (and other plots to assess the fit of the model) can also be produced.

# look at diagnostic plotplotdiag <- autoplot(plants.fit)plotdiag

The residuals are simply the difference between the actual data point and the corresponding spot onthe line measured in the vertical direction. The residual plot shows no trend in the scatter around thevalue of zero.



14.4.9 Example - Mercury pollution

Mercury pollution is a serious problem in some waterways. Mercury levels often increase after a lakeis flooded due to leaching of naturally occurring mercury by the higher levels of the water. Excessiveconsumption of mercury is well known to be deleterious to human health. It is difficult and time con-suming to measure every persons mercury level. It would be nice to have a quick procedure that couldbe used to estimate the mercury level of a person based upon the average mercury level found in fishand estimates of the person’s consumption of fish in their diet. The following data were collected onthe methyl mercury intake of subjects and the actual mercury levels recorded in the blood stream from arandom sample of people around recently flooded lakes.

Here are the raw data:

Methyl Mercury Mercury in

Intake whole blood

(ug Hg/day) (ng/g)

180 90

200 120

230 125

410 290

600 310

550 290

275 170

580 375

600 150

105 70

250 105

60 205

650 480

The data is available in the mercury.csv file in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets. The data are imported into R usingthe read.csv() function:

merc <- read.csv("mercury.csv", header=TRUE, as.is=TRUE, strip.white=TRUE)merc

The raw data are shown below:

intake blood1 180 902 200 1203 230 1254 410 2905 600 3106 550 2907 275 170





8 580 3759 600 15010 105 7011 250 10512 60 20513 650 480

The ordering of the rows in the data table is NOT important; however, it is often easier to findindividual data points if the data is sorted by the X value and the rows for future predictions are placedat the end of the dataset. Notice how missing values are represented.

The population of interest are the people around recently flooded lakes.

This experiment is an analytical survey as it is quite impossible to randomly assign people differentamounts of mercury in their food intake. Consequently, the key assumption is that the subjects chosento be measured are random samples from those with similar mercury intakes. Note it is NOT necessaryfor this to be a random sample from the ENTIRE population (why?).

The explanatory variable is the amount of mercury ingested by a person. The response variable is theamount of mercury in the blood stream.

We start by producing the scatter-plot.

prelimplot <-ggplot(data=merc, aes(x=intake, y=blood))+ggtitle(’Blood Hg vs. intake Hg’)+ylab(’Blood Hg’)+xlab(’Intake Hg’)+geom_point()

prelimplot

with output:



There appears to be two outliers (identified by an X). To illustrate the effects of these outliers uponthe estimates and the residual plots, the line was fit using all of the data.

Note the use of the residuals() function to extract the residuals from the fitted object and the use ofthe plot() function to plot the extracted residuals.

blood.fit.outliers <- lm(blood ~ intake, data=merc)summary(blood.fit.outliers)

diagplot <- autoplot(blood.fit.outliers)diagplot

with output:

Call:lm(formula = blood ~ intake, data = merc)


-172.20 -29.62 -12.20 53.86 135.15


(Intercept) 50.4441 47.7674 1.056 0.31359intake 0.4529 0.1154 3.925 0.00237 **---



Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 84.45 on 11 degrees of freedomMultiple R-squared: 0.5834, Adjusted R-squared: 0.5456F-statistic: 15.41 on 1 and 11 DF, p-value: 0.002372

and residual plot

The residual plot shows the clear presence of the two outliers, but also identifies a third potentialoutlier not evident from the original scatter-plot (can you find it?).

The data were rechecked and it appears that there was an error in the blood work used in determiningthe readings. Consequently, these points were removed for the subsequent fit.

Unfortunately, in R there is no simple way to click on the plot to identify the outliers as in packagessuch as JMP. Some old fashioned detective work is required to find and remove the points.



keep <- abs(residuals(blood.fit.outliers))<100merc2 <- merc[keep,]blood.fit <- lm(blood ~ intake, data=merc2)summary(blood.fit)

diagplot <- autoplot(blood.fit.outliers)diagplot

fitplot <- prelimplot + geom_smooth(method="lm", se=FALSE)fitplot

with output:

Call:lm(formula = blood ~ intake, data = merc2)


-38.35 -23.96 -0.51 11.82 53.65


(Intercept) -1.95169 22.71513 -0.086 0.934intake 0.58122 0.05983 9.715 1.05e-05 ***---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 32.72 on 8 degrees of freedomMultiple R-squared: 0.9219, Adjusted R-squared: 0.9121F-statistic: 94.37 on 1 and 8 DF, p-value: 1.053e-05

and residual plot



The estimated regression line (after removing outliers) is

Blood = −1.951691 + 0.581218Intake

.

The estimated slope of 0.58 indicates that the mercury level in the blood increases by 0.58 ng/daywhen the intake level in the food is increased by 1 ug/day. The intercept has no really meaning in thecontext of this experiment. The negative value is merely a placeholder for the line. Also notice that theestimated intercept is not very precise in any case (how do I know this and what implications does thishave for worrying that it is not zero?)4

What was the impact of the outliers if they had been retained upon the estimated slope and intercept?

The estimated slope has been determined relatively well (relative standard error of about 10% – howis the relative standard error computed?). There is clear evidence that the hypothesis of no relationshipbetween blood mercury levels and food mercury levels is not tenable.

4It is possible to fit a regression line that is constrained to go through Y = 0 when X = 0. These must be fit carefully and arenot covered in this course.



The two types of predictions would also be of interest in this study. First, an individual would like toknow the impact upon personal health. Secondly, the average level would be of interest to public healthauthorities.

Note we sort the X values before plotting the confidence limits to get a nice plot.

my.pred.mean<- predict(blood.fit, newdata=merc,interval=’confidence’)

my.pred.indiv <- predict(blood.fit, newdata=merc,interval=’prediction’)

plot(merc$intake, merc$blood,main=’Blood Hg vs. intake Hg’)

abline(blood.fit) # notice use the object# We need to sort by the intake values to get nice plotslines(merc$intake[order(merc$intake)], my.pred.mean[order(merc$intake),"lwr"], lty=2)lines(merc$intake[order(merc$intake)], my.pred.mean[order(merc$intake),"upr"], lty=2)lines(merc$intake[order(merc$intake)], my.pred.indiv[order(merc$intake),"lwr"],lty=3)lines(merc$intake[order(merc$intake)], my.pred.indiv[order(merc$intake),"upr"],lty=3)

to get:



14.4.10 Example - The Anscombe Data Set

Anscombe (1973, American Statistician 27, 17-21) created a set of 4 data sets that were quite remarkable.All four datasets gave exactly the same results when a regression line was fit, yet are quite different intheir interpretation.

The Anscombe data is available at the http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets.Fitting of regression lines to this data will be demonstrated in class.

14.4.11 Transformations

In some cases, the plot of Y vs.X is obviously non-linear and a transformation ofX or Y may be used toestablish linearity. For example, many dose-response curves are linear in log(X). Or the equation maybe intrinsically non-linear, e.g. a weight-length relationship is of the form weight = β0length

β1 . Or,some variables may be recorded in an arbitrary scale, e.g. should the fuel efficiency of a car be measured




in L/100 km or km/L? You are already with some variables measured on the log-scale - pH is a commonexample.

Often a visual inspection of a plot may identify the appropriate transformation.

There is no theoretical difficulty in fitting a linear regression using transformed variables other thanan understanding of the implicit assumption of the error structure. The model for a fit on transformeddata is of the form

trans(Y ) = β0 + β1 × trans(X) + error

Note that the error is assumed to act additively on the transformed scale. All of the assumptions oflinear regression are assumed to act on the transformed scale – in particular that the population standarddeviation around the regression line is constant on the transformed scale.

The most common transformation is the logarithmic transform. It doesn’t matter if the natural log-arithm (often called the ln function) or the common logarithm transformation (often called the log10

transformation) is used. There is a 1-1 relationship between the two transformations, and linearity onone transform is preserved on the other transform. The only change is that values on the ln scale are2.302 = ln(10) times that on the log10 scale which implies that the estimated slope and intercept bothdiffer by a factor of 2.302. There is some confusion in scientific papers about the meaning of log - somepapers use this to refer to the ln transformation, while others use this to refer to the log10 transformation.

After the regression model is fit, remember to interpret the estimates of slope and intercept on thetransformed scale. For example, suppose that a ln(Y ) transformation is used. Then we have

ln(Yt+1) = b0 + b1 × (t+ 1)

ln(Yt) = b0 + b1 × t

andln(Yt+1)− ln(Yt) = ln(

Yt+1

Yt) = b1 × (t+ 1− t) = b1

.exp(ln(

Yt+1

Yt)) =

Yt+1

Yt= exp(b1) = eb1

Hence a one unit increase in X causes Y to be MULTIPLED by eb1 . As an example, suppose that onthe log-scale, that the estimated slope was −.07. Then every unit change in X causes Y to change by amultiplicative factor or e−.07 = .93, i.e. roughly a 7% decline per year.5

Similarly, predictions on the transformed scale, must be back-transformed to the untransformed scale.

In some problems, scientists search for the ‘best’ transform. This is not an easy task and using simplestatistics such as R2 to search for the best transformation should be avoided. Seek help if you need tofind the best transformation for a particular dataset.

JMP makes it particularly easy to fit regressions to transformed data as shown below. SAS and Rhave an extensive array of functions so that you can create new variables based the transformation of anexisting variable.

5It can be shown that on the log scale, that for smallish values of the slope that the change is almost the same on the untrans-formed scale, i.e. if the slope is −.07 on the log sale, this implies roughly a 7% decline per year; a slope of +.07 implies roughlya 7% increase per year.



14.4.12 Example: Monitoring Dioxins - transformation

An unfortunate byproduct of pulp-and-paper production used to be dioxins - a very hazardous material.This material was discharged into waterways with the pulp-and-paper effluent where it bioaccumulatedin living organisms such a crabs. Newer processes have eliminated this by product, but the dioxins in theorganisms takes a long time to degrade.

Government environmental protection agencies take samples of crabs from affected areas each yearand measure the amount of dioxins in the tissue. The following example is based on a real study.

Each year, four crabs are captured from a monitoring station. The liver is excised and the livers fromall four crabs are composited together into a single sample.6 The dioxins levels in this composite sampleis measured. As there are many different forms of dioxins with different toxicities, a summary measure,called the Total Equivalent Dose (TEQ) is computed from the sample.

Here is the raw data.

Site Year TEQ

a 1990 179.05

a 1991 82.39

a 1992 130.18

a 1993 97.06

a 1994 49.34

a 1995 57.05

a 1996 57.41

a 1997 29.94

a 1998 48.48

a 1999 49.67

a 2000 34.25

a 2001 59.28

a 2002 34.92

a 2003 28.16

The data is available in the dioxinTEQ.csv file in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets. The data are imported into R usingthe read.csv() function:

crabs <- read.csv("dioxinTEQ.csv", header=TRUE,as.is=TRUE, strip.white=TRUE,na.string=".")

head(crabs)str(crabs)

Note that both variables are numeric (R doesn’t have the concept of scale of variables) The ordering ofthe rows is NOT important; however, it is often easier to find individual data points if the data is sorted by

6 Compositing is a common analytical tool. There is little loss of useful information induced by the compositing process - theonly loss of information is the among individual-sample variability which can be used to determine the optimal allocation betweensamples within years and the number of years to monitor.





theX value. It is common practice in many statistical packages to add extra rows at the end of data set forfuture predictions; however, as you will see later, this is not necessary (and leads to some complicationslater) in RConsequently, I usually “delete” observations with missing Y or missing X values prior to afit.

Part of the raw data and the structure of the data frame are shown below:

site year WHO.TEQ1 a 1990 179.052 a 1991 82.393 a 1992 130.184 a 1993 97.065 a 1994 49.346 a 1995 57.05’data.frame’: 15 obs. of 3 variables:$ site : chr "a" "a" "a" "a" ...$ year : int 1990 1991 1992 1993 1994 1995 1996 1997 1998 1999 ...$ WHO.TEQ: num 179.1 82.4 130.2 97.1 49.3 ...

As with all analyses, start with a preliminary plot of the data. This is done in the usual way using theggplot package.

plotprelim <- ggplot(data=crabs, aes(x=year, y=WHO.TEQ))+ggtitle("Dioxin levels over time")+xlab("Year")+ylab("Dioxin levels (WHO.TEQ)")+geom_point()

plotprelim



The preliminary plot of the data shows a decline in levels over time, but it is clearly non-linear. Whyis this so? In many cases, a fixed fraction of dioxins degrades per year, e.g. a 10% decline per year. Thiscan be expressed in a non-linear relationship:

TEQ = Crt

where C is the initial concentration, r is the rate reduction per year, and t is the elapsed time. If this isplotted over time, this leads to the non-linear pattern seen above.

If logarithms are taken, this leads to the relationship:

log(TEQ) = log(C) + t× log(r)

which can be expressed as:log(TEQ) = β0 + β1 × t

which is the equation of a straight line with β0 = log(C) and β1 = log(r).

We add a new variable to the data frame. Note that the log() function is the natural logarithmic (basee) function.

crabs$logTEQ <- log(crabs$WHO.TEQ)head(crabs)

giving:

site year WHO.TEQ logTEQ1 a 1990 179.05 5.1876652 a 1991 82.39 4.4114643 a 1992 130.18 4.8689184 a 1993 97.06 4.5753295 a 1994 49.34 3.8987356 a 1995 57.05 4.043928

A plot of log(TEQ) vs. year gives the following:



The relationship look approximately linear; there don’t appear to be any outlier or influential points;the scatter appears to be roughly equal across the entire regression line. Residual plots will be used laterto check these assumptions in more detail.

We use the lm() function to fit the regression model:

crabs.fit <- lm( logTEQ ~ year, data=crabs)summary(crabs.fit)

The formula in the lm() function is what tells R that the response variable is logTEQ because itappears to the left of the tilde sign, and that the predictor variable is year because it appears to the rightof the tilde sign.

The summary() function produces the table that contains the estimates of the regression coefficientsand their standard errors and various other statistics

Call:lm(formula = logTEQ ~ year, data = crabs)


-0.59906 -0.16260 -0.01206 0.14054 0.51449


(Intercept) 218.91364 42.79187 5.116 0.000255 ***year -0.10762 0.02143 -5.021 0.000299 ***



---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1


Multiple R-squared: 0.6775, Adjusted R-squared: 0.6506F-statistic: 25.21 on 1 and 12 DF, p-value: 0.0002986

It is possible to extract all of the individual pieces using the standard methods (specialized functionsto be applied to the results of a model fitting):

# Extract the individual parts of the fit using the# standard methodsanova(crabs.fit)coef(crabs.fit)sqrt(diag(vcov(crabs.fit))) # gives the SEconfint(crabs.fit)names(summary(crabs.fit))summary(crabs.fit)$r.squaredsummary(crabs.fit)$sigma

As expected these match the previous outputs:

Analysis of Variance Table

Response: logTEQDf Sum Sq Mean Sq F value Pr(>F)

year 1 2.6349 2.63488 25.211 0.0002986 ***Residuals 12 1.2541 0.10451---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1(Intercept) year218.9136363 -0.1076191(Intercept) year42.7918714 0.0214334

2.5 % 97.5 %(Intercept) 125.6781579 312.14911470year -0.1543185 -0.06091975[1] "call" "terms" "residuals" "coefficients"[5] "aliased" "sigma" "df" "r.squared"[9] "adj.r.squared" "fstatistic" "cov.unscaled" "na.action"[1] 0.677518[1] 0.3232822

The fitted line is:log(TEQ) = 218.9− .11(year)

.



The intercept (218.9) would be the log(TEQ) in the year 0 which is clearly nonsensical. The slope(−.11) is the estimated log(ratio) from one year to the next. For example, exp(−.11) = .898 wouldmean that the TEQ in one year is only 89.8% of the TEQ in the previous year or roughly an 11% declineper year.7

The standard error of the estimated slope is .02. If you want to find the standard error of the anti-logof the estimated slope, you DO NOT take exp(0.02). Rather, the standard error of the ant-logged valueis found as seantilog = selog exp(slope) = 0.02× .898 = .01796.8

We find the confidence intervals using the confint() function applied to the fitted objects as seenabove.:

The 95% confidence interval for the slope (on the log-scale) is (−.154 to −.061). If you take theanti-logs of the endpoints, this gives a 95% confidence interval for the fraction of TEQ that remains fromyear to year, i.e. between (0.86 to 0.94) of the TEQ in one year, remains to the next year.

As always, the model diagnostics should be inspected early on in the process: These are the standardresidual and normal probability plots. Leverage plots are less important in the simple regression case asthey can usually by spotted directly from the preliminary plot

# look at diagnostic plotplotdiag <- autoplot(crabs.fit)plotdiag

This gives:

7 It can be shown that in regressions using a log(Y ) vs. time, that the estimated slope on the logarithmic scale is the approximatefraction decline per time interval. For example, in the above, the estimated slope of −.11 corresponds to an approximate 11%decline per year. This approximation only works well when the slopes are small, i.e. close to zero.

8 This is computed using a method called the delta-method.



The residual plot looks fine with no apparent problems but the dip in the middle years could requirefurther exploration if this pattern was apparent in other sites as well. This type of pattern may be evidenceof autocorrelation.

The Durbin-Watson test is available in the lmtest and car package.

# check for autocorrelation using Durbin-Watson test# You can use the durbinWatsontest in the car package or the# dwtest in the lmtest package# For small sample sizes both are fine; for larger sample sizes use the lmtest package# Note the difference in the default direction of the alternative hypothesis

durbinWatsonTest(crabs.fit) # from the car packagedwtest(crabs.fit) # from the lmtest package

Note that the default action of the two functions uses a different alternate hypothesis for computingthe p-values (one function returns the one-sided p-value while the other function returns the two-sided



p-value) and use different approximations to compute the p-values. Hence the results may look slightlydifferent:

lag Autocorrelation D-W Statistic p-value1 -0.09311213 2.034426 0.794

Alternative hypothesis: rho != 0

Durbin-Watson test

data: crabs.fitDW = 2.0344, p-value = 0.3974alternative hypothesis: true autocorrelation is greater than 0

Here there is no evidence of auto-correlation so we can proceed without worries.

We can add the fitted line to the plot:

# plot the fitted line to the graphsplotfit <- plotprelimlog +

geom_abline(intercept=coef(crabs.fit)[1], slope=coef(crabs.fit)[2])plotfit

giving:

It is possible to plot the data on the original scale by doing a back-transform of the fitted values – see theR code for details.

Several types of predictions can be made. For example, what would be the estimated mean logTEQ



in 2010? What is the range of logTEQ’s in 2010? Again, refer back to previous chapters about thedifferences in predicting a mean response and predicting an individual response.

To make predictions, we first create a data frame showing the new values of X for which we wantpredictions:

# make predictions# First set up the points where you want predictionsnewyears <- data.frame(year=seq(min(crabs$year,na.rm=TRUE),2030,1))newyears[1:5,]str(newyears)

giving:

[1] 1990 1991 1992 1993 1994’data.frame’: 41 obs. of 1 variable:$ year: num 1990 1991 1992 1993 1994 ...

The predict() function is used to estimate the response and a confidence interval for the mean re-sponse at the values created above. Notice the value of the interval= argument in the predict() functionto specify that the confidence interval for the mean response is wanted.

# Predict the AVERAGE dioxin level at each year# You need to specify help(predict.lm) tp see the documentationpredict.avg <- predict(crabs.fit, newdata=newyears,

se.fit=TRUE,interval="confidence")# This creates a list that you need to restructure to make it look nicepredict.avg.df <- cbind(newyears, predict.avg$fit, se=predict.avg$se.fit)head(predict.avg.df)predict.avg.df[predict.avg.df$year==2010,]exp(predict.avg.df[predict.avg.df$year==2010,])# Add the confidence intervals to the plotplotfit.avgci <- plotfit +

geom_ribbon(data=predict.avg.df, aes(x=year,y=NULL, ymin=lwr, ymax=upr),alpha=0.2)+xlim(c(1990,2010))

plotfit.avgci

giving:

year fit lwr upr se1 1990 4.751590 4.394409 5.108772 0.163933992 1991 4.643971 4.325524 4.962419 0.146156313 1992 4.536352 4.254217 4.818488 0.129490384 1993 4.428733 4.179427 4.678040 0.114423055 1994 4.321114 4.099599 4.542629 0.101667556 1995 4.213495 4.012633 4.414356 0.09218854

year fit lwr upr se21 2010 2.599208 1.941261 3.257156 0.3019752



year fit lwr upr se21 Inf 13.45308 6.967529 25.97555 1.352528

Similarly, the predict() function is used to estimate the response and a prediction interval for the in-dividual response at the values created above. Notice the value of the interval= argument in the predict()function to specify that the prediction interval for the mean response is wanted. Also notice that the formof the returned object differs slight from that previously requiring a slight change in programming toextract the values and make a nice table.

# Predict the INDIVIDUAL dioxin levels n each year# This is a bit strange because the data points are the dioxin level in a composite# sample and not individual crabs. So these prediction intervals# refer to the range of composite values and not the# levels in individual crabs.# R does not product the se for individual predictionspredict.indiv <- predict(crabs.fit, newdata=newyears,

interval="prediction")# This creates a list that you need to restructure to make it look nicepredict.indiv.df <- cbind(newyears, predict.indiv)head(predict.indiv.df)predict.indiv.df [predict.indiv.df$year==2010,]exp(predict.indiv.df[predict.indiv.df$year==2010,])

# Add the prediction intervals to the plotplotfit.indivci <- plotfit.avgci +

geom_ribbon(data=predict.indiv.df, aes(x=year,y=NULL, ymin=lwr, ymax=upr),alpha=0.1)plotfit.indivci



giving:

year fit lwr upr1 1990 4.751590 3.961833 5.5413482 1991 4.643971 3.870959 5.4169833 1992 4.536352 3.777577 5.2951274 1993 4.428733 3.681543 5.1759235 1994 4.321114 3.582732 5.0594966 1995 4.213495 3.481044 4.945946

year fit lwr upr21 2010 2.599208 1.635344 3.563072

year fit lwr upr21 Inf 13.45308 5.131223 35.27139

The estimated mean log(TEQ) in 2010 is 2.60 (corresponding to an estimated MEDIAN TEQ ofexp(2.60) = 13.46). A 95% confidence interval for the mean log(TEQ) is (1.94 to 3.26) correspondingto a 95% confidence interval for the actual MEDIAN TEQ of between (6.96 and 26.05).9 Note that theconfidence interval after taking anti-logs is no longer symmetrical.

Why does a mean of a logarithm transform back to the median on the untransformed scale? Basically,because the transformation is non-linear, properties such mean and standard errors cannot be simplyanti-transformed without introducing some bias. However, measures of location, (such as a median) areunaffected. On the transformed scale, it is assumed that the sampling distribution about the estimate issymmetrical which makes the mean and median take the same value. So what really is happening is thatthe median on the transformed scale is back-transformed to the median on the untransformed scale.

Similarly, a 95% prediction interval for the log(TEQ) for an INDIVIDUAL composite sample can

9 A minor correction can be applied to estimate the mean if required.



be found. Be sure to understand the difference between the two intervals.

Finally, an inverse prediction is sometimes of interest, i.e. in what year, will the TEQ be equal to someparticular value. For example, health regulations may require that the TEQ of the composite sample bebelow 10 units.

Rather surprisingly, R does NOT have a function in the base release, nor in any of the packages,for inverse regression. A few people have written “one-off” functions, but the code needs to checkedcarefully. For this class, I would plot the confidence and prediction intervals, and then work ’backward’from the target Y value to see where it hits the confidence limits and then drop down to the X axis:

We compute the predicted values for a wide range of X values and get the plot of the two intervals andthen follow the example above:

plotinvpred <- plotfit.indivci +geom_hline(yintercept=log(10))+xlim(c(1990, 2030))

plotinvpred



The predicted year is found by solving

2.302 = 218.9− .11(year)

and gives and estimated year of 2012.7. A confidence interval for the time when the mean log(TEQ) isequal to log(10) is somewhere between 2007 and 2026!

The application of regression to non-linear problems is fairly straightforward after the transformationis made. The most error-prone step of the process is the interpretation of the estimates on the TRANS-FORMED scale and how these relate to the untransformed scale.

14.4.13 Example: Weight-length relationships - transformation

A common technique in fisheries management is to investigate the relationship between weight andlengths of fish.

This is expected to a non-linear relationship because as fish get longer, they also get wider and thicker.If a fish grew “equally” in all directions, then the weight of a fish should be proportional to the length3

(why?). However, fish do not grow equally in all directions, i.e. a doubling of length is not necessarilyassociated with a doubling of width or thickness. The pattern of association of weight with length mayreveal information on how fish grow.

The traditional model between weight and length is often postulated to be of the form:

weight = a× lengthb

where a and b are unknown constants to be estimated from data.

If the estimated value of b is much less than 3, this indicates that as fish get longer, they do not getwider and thicker at the same rates.



How are such models fit? If logarithms are taken on each side, the above equation is transformed to:

log(weight) = log(a) + b× log(length)

orlog(weight) = β0 + β1 × log(length)

where the usual linear relationship on the log-scale is now apparent.

The following example was provided by Randy Zemlak of the British Columbia Ministry of Water,Land, and Air Protection.



Length (mm) Weight (g)

34 585

46 1941

33 462

36 511

32 428

33 396

34 527

34 485

33 453

44 1426

35 488

34 511

32 403

31 379

30 319

33 483

36 600

35 532

29 326

34 507

32 414

33 432

33 462

35 566

34 454

35 600

29 336

31 451

33 474

32 480

35 474

30 330

30 376

34 523

31 353

32 412

32 407

The data is available in the wtlen.csv file in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets. The data are imported into R using the read.csv()function:





wtlen <- read.csv("wtlen.csv", header=TRUE)wtlen$log.weight <- log(wtlen$weight)wtlen$log.length <- log(wtlen$length)wtlen[1:5,]

The log(weight) and log(length) are added to the data frame. Note that the log() function is thenatural logarithm (base e) function.

Part of the raw data are shown below:

length weight log.weight log.length1 33.8 585 6.371612 3.5204612 45.5 1941 7.570959 3.8177123 32.9 462 6.135565 3.4934734 35.5 511 6.236370 3.5695335 32.4 428 6.059123 3.478158

We plot the raw data and use the lowess() function to fit a smooth curve:

plot(wtlen$log.length, wtlen$log.weight,main=’log(weight) vs. log(length) ’,ylab=’log(weight) ’,xlab=’log(length) ’)

lines(lowess(wtlen$log.length,wtlen$log.weight))



The fit appears to be non-linear but this may simply be an artifact of the influence of the two largestfish. The plot appears to be linear in the area of 30-35 mm in length. If you look at the plot carefully, thevariance appears to be increasing with the length with the spread noticeably wider at 35 mm than at 30mm.

We will fit a model on the log-log scale: Note that there is some confusion in scientific papers abouta “log” transform. In general, a log-transformation refers to taking natural-logarithms (base e), andNOT the base-10 logarithm. This mathematical convention is often broken in scientific papers whereauthors try to use ln to represent natural logarithms, etc. It does not affect the analysis in anyway whichtransformation is used other than that values on the natural log scale are approximately 2.3 times largerthan values on the log10 scale. Of course, the appropriate back transformation is required.

We do the usual fit and examination of the residuals:

weight.fit.outliers <- lm(log.weight ~ log.length, data=wtlen)summary(weight.fit.outliers)layout(matrix(1:2),2,1)plot(residuals(weight.fit.outliers))



abline(h=0)plot(wtlen$log.length, wtlen$log.weight,

main=’log(weight) vs. log(length) ’,ylab=’log(weight) ’,xlab=’log(length) ’)

abline(weight.fit.outliers)

Call:lm(formula = log.weight ~ log.length, data = wtlen)


-0.180994 -0.060423 0.003079 0.036923 0.264263


(Intercept) -6.6064 0.6185 -10.68 1.47e-12 ***log.length 3.6444 0.1763 20.68 < 2e-16 ***---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 0.09656 on 35 degrees of freedomMultiple R-squared: 0.9243, Adjusted R-squared: 0.9222F-statistic: 427.5 on 1 and 35 DF, p-value: < 2.2e-16



The fit is not very satisfactory. The curve doesn’t seem to fit the two “outlier points very well”. Atsmaller lengths, the curve seems to under fitting the weight. The residual plot appears to show the twodefinite outliers and also shows some evidence of a poor fit with positive residuals at lengths 30 mm andnegative residuals at 35 mm.

The fit was repeated dropping the two largest fish with the following output:

keep <- wtlen$log.length <3.7wtlen2 <- wtlen[keep,]weight.fit<- lm(log.weight ~ log.length, data=wtlen2)summary(weight.fit)layout(matrix(1:2),2,1)plot(residuals(weight.fit))abline(h=0)plot(wtlen2$log.length, wtlen2$log.weight,

main=’log(weight) vs. log(length) 2 large fish removed ’,ylab=’log(weight) ’,xlab=’log(length) ’)



abline(weight.fit)

Call:lm(formula = log.weight ~ log.length, data = wtlen2)


-0.132731 -0.050609 0.001747 0.035298 0.182784


(Intercept) -3.5531 0.7443 -4.774 3.59e-05 ***log.length 2.7672 0.2132 12.980 1.63e-14 ***---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 0.07112 on 33 degrees of freedomMultiple R-squared: 0.8362, Adjusted R-squared: 0.8313F-statistic: 168.5 on 1 and 33 DF, p-value: 1.627e-14



Now the fit appears to be much better. The relationship (on the log-scale) is linear, the residual plotlooks OK.

The estimated power coefficient is 2.76 (SE .21). We find the 95% confidence interval for the slope(the power coefficient):

confint(weight.fit)

2.5 % 97.5 %(Intercept) -5.067352 -2.038749log.length 2.333482 3.200951

The 95% confidence interval for the power coefficient is from (2.33 to 3.2) which includes the valueof 3 – hence the growth could be allometric, i.e. a fish that is twice the length also is twice the width andtwice the thickness. Of course, with this small sample size, it is too difficult to say too much.



The actual model in the population is:

log(weight) = β0 + β1 log(length) + ε

This implies that the “errors” in growth act on the LOG-scale. This seems reasonable.

For example, a regression on the original scale would make the assumption that a 20 g error inpredicting weight is equally severe for a fish that (on average) weighs 200 or 400 grams even though the"error" is 20/200=10% of the predicted value in the first case, while only 5% of the predicted value in thesecond case. On the log-scale, it is implicitly assumed that the “errors” operate on the log-scale, i.e. a10% error in a 200 g fish is equally severe as a 10% error in a 400 g fish even though the absolute errorsof 20g and 40g are quite different.

Another assumption of regression analysis is that the population error variance is assumed to be con-stant over the entire regression line, but the original plot shows that the standard deviation is increasingwith length. On the log-scale, the standard deviation is roughly constant over the entire regression line.

A non-linear fit

It is also possible to do a direct non-linear least-squares fit. Here the objective is to find values of β0 andβ1 to minimize: ∑

(weight− β0 × lengthβ1)2

directly.

The nls() function can be used to fit the non-linear least-squares line directly:

wtlen.nls <- nls( weight ~ b0*length**b1, data=wtlen2,start=list(b0=1, b1=3))

summary(wtlen.nls)

plot(wtlen2$length, wtlen2$weight,main=’weight vs. length fit using nls ’,ylab=’weight ’,xlab=’length ’)

points(wtlen2$length[order(wtlen2$length)], predict(wtlen.nls)[order(wtlen2$length)], type="l")

Formula: weight ~ b0 * length^b1

Parameters:Estimate Std. Error t value Pr(>|t|)

b0 0.03233 0.02752 1.175 0.248b1 2.73323 0.24269 11.262 7.61e-13 ***---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 34.19 on 33 degrees of freedom

Number of iterations to convergence: 6



Achieved convergence tolerance: 3.457e-08

null device1

The estimated power coefficient from the non-linear fit is 2.73 with a standard error of .24. Theestimated intercept is 0.0323 with an estimated standard error of .027. Both estimates are similar to theprevious fit.

Which is a better method to fit this data? The non-linear fit assumes that error are additive on theoriginal scale. The consequences of this were discussed earlier, i.e. a 20 g error is equally serious for a200 g fish as for a 400 g fish.

For this problem, both the non-linear fit and the fit on the log-scale gave the same results, but thiswill not always be true. In particular, look at the large difference in estimates when the models were fitto the all of the fish. The non-linear fit was more influenced by the two large fish - this is a consequenceof the minimizing the square of the absolute deviation (as opposed to the relative deviation) between the



observed weight and predicted weight.

14.4.14 Power/Sample Size

A power analysis and sample size determination can also be done for regression problems, but is morecomplicated than power analyses for simple experimental designs. This is for a number of reasons:

• The power depends not only on the total number of points collected, but also on the actual distri-bution of the X values.

For example, the power to detect a trend is different if the X values are evenly distributed over therange of predictors than if the X values are clustered at the ends of the range of the predictors. Aregression analysis has the most power to detect a trend if half the observations are collected at asmall X value and half of the observations are collected at a large X value. However, this type ofdata gives no information on the linearity (or lack there-of) between the two X values and is notrecommended in practice. A less powerful design would have a range of X values collected, butthis is often more of interest as lack-of-fit and non-linearity can be collected.

• Data collected for regression analysis is often opportunistic with little chance of choosing the Xvalues. Unless you have some prior information on the distribution of the X values, it is difficultto determine the power.

• The formula are clumsy to compute by hand, and most power packages tend not to have modulesfor power analysis of regression. However, modern software should be able to deal with this issue.

For a power analysis, the information required is similar to that requested for ANOVA designs:

• α level. As in power analyses for ANOVA, this is traditionally set to α = 0.05.

• effect size. In ANOVA, power deals with detection of differences among means. In regressionanalysis, power deals with detection of slopes that are different from zero. Hence, the effect sizeis measured by the slope of the line, i.e. the rate of change in the mean of Y per unit change in X .

• sample size. Recall in ANOVA with more than two groups, that the power depended not onlyonly the sample size per group, but also how the means are separated. In regression analysis, thepower will depend upon the number of observations taken at each value of X and the spread ofthe X values. For example, the greatest power is obtained when half the sample is taken at the twoextremes of the X space - but at a cost of not being able to detect non-linearity. It turns out that asimple summary of the distribution of the X values (the standard deviation of the X values) is allthat is needed.

• standard deviation. As in ANOVA, the power will depend upon the variation of the individualobjects around the regression line.

JMP (V.10) does not currently contain a module to do power analysis for regression. R also doesnot include a power computation module for regression analysis but I have written a small function thatis available in the SampleProgramLibrary. SAS (Version 9+) includes a power analysis module (GLM-POWER) for the power analysis. Russ Lenth also has a JAVA applet that can be used for determiningpower in a regression context http://homepage.stat.uiowa.edu/~rlenth/Power/.

The problem simplifies considerably when the the X variable is time, and interest lies in detectinga trend (increasing or decreasing) over time. A linear regression of the quantity of interest against time


http://homepage.stat.uiowa.edu/~rlenth/Power/


is commonly used to evaluate such a trend. For many monitoring designs, observations are taken ona yearly basis, so the question reduces to the number of years of monitoring required. The analysis oftrend data and power/sample size computations is treated in a following chapter.

Let us return to the example of the yield of tomatoes vs. the amount of fertilizer. We wish to designan experiment to detect a slope of 1 (the effect size). From past data (on a different field), the standarddeviation of values about the regression line is about 4 units (the standard deviation of the residuals).

We have enough money to plant 12 plots with levels of fertilizer ranging from 10 to 20. How doesthe power compare under different configuration of choices of fertilizer levels. More specifically, howdoes the power compare between using fertilizer levels (10, 11, 12, 13, 14, 15, 15, 16, 17, 18, 19, 20),i.e. an even distribution of levels of fertilizer, and (10, 10, 12, 12, 14, 14, 16, 16, 18, 18, 20, 20), i.e.doing two replicates at each level of fertilizer but doing fewer distinct levels?

I have written two R-functions to compute power in cases of simple linear regression. One functioncomputes power using analytical methods based on a paper by Stroup (1999).10 The second functionestimates the power using a simulation method, i.e. it generates ‘fake’ datasets with the properties similarto the design under question, fit the regression line, and then sees what fraction of the simulated datasets detect a slope different than zero. Normally about 1000 simulated datasets is sufficient for powerinvestigations.

The R-functions that I have written are more general that standard regression cases in that they allowboth process and sampling error. Please consult the chapter on trend analysis for an explanation ofprocess’ vs. sampling error. In this example, we assume that each plot’s growth is independent of anyother plot and that all the plots are grown in the same growing season. For this reason, we assume thatprocess error is zero.

We call the two versions of the power functions. The arguments lists are self explanatory.

Xvalues <- c(10, 11, 12, 13, 14, 15, 15, 16, 17, 18, 19, 20)

cat("Power for X values: ", Xvalues, "\n")slr.power.stroup(Trend=1, Xvalues=Xvalues, # evenly spaced data

Process.SD=0, Sampling.SD=4, alpha=0.05)slr.power.sim (Trend=1, Xvalues=Xvalues, # evenly spaced data

Process.SD=0, Sampling.SD=4, alpha=0.05, nsim=1000)

This computes the power as:

Power for X values: 10 11 12 13 14 15 15 16 17 18 19 20alpha Trend Process.SD Sampling.SD Beta0 Beta1 dfdenom

1 0.05 1 0 4 -10 1 10ncp Fcrit power.2s Tcrit power.1s.a power.1s.b

1 6.875 4.964603 0.657489 1.812461 0.7863745 1.997475e-05Power.slope

0.672

The power is computed to be 0.657 from the analytical methods and estimated to be 0.673 via simulation10Stroup, W. W. (1999). Mixed model procedures to assess power, precision, and sample size in the design of experiments.

Pages 15-24 in Proc. Biopharmaceutical Section. Am. Stat. Assoc., Baltimore, MD.



(of course the simulated values will differ every time the function is run and will be accurate to within 3percentage points of the analytical values, 19 times out of 20).

The functions can be re-run with the revised levels of fertilizer and the estimated power is:

Power for X values: 10 10 12 12 14 14 16 16 18 18 20 20alpha Trend Process.SD Sampling.SD Beta0 Beta1 dfdenom

1 0.05 1 0 4 -10 1 10ncp Fcrit power.2s Tcrit power.1s.a power.1s.b

1 8.75 4.964603 0.7599065 1.812461 0.8653542 4.819913e-06Power.slope

0.756

The power has increased to 0.760 (computed using the analytical methods) and 0.756 (computed by thesimulation methods)

This increase in power is intuitively correct – power increases in regression as the number of datapoints at each end of the X range increase, all else being equal.

The power to detect a range of slopes using the last set of X values was also computed (see the R andSAS code) and a plot of the power vs. the size of the slope can be made.

Both the simulated and analytical power estimates are comparable (as they must be).The power to detect smaller slopes is limited.

Russ Lenth’s power modules11 can be used to compute the power for these two cases. Here themodules require the standard deviation of the X values but this needs to be computed using the n divisor

11http://homepage.stat.uiowa.edu/~rlenth/Power/


http://homepage.stat.uiowa.edu/~rlenth/Power/


rather than the n− 1 divisor, i.e.

SDLenth(X) =

√∑(X −X)2

n

For the two sets of fertilizer values the SDs are 3.02765 and 3.41565 respectively.

The output from Lenth’s power analysis are:

which match the earlier results (as they must).

14.4.15 The perils of R2

R2 is a “popular” measure of the fit of a regression model and is often quoted in research papers asevidence of a good fit etc. However, there are several fundamental problems ofR2 which, in my opinion,make it less desirable. A nice summary of these issues is presented in Draper and Smith (1998, AppliedRegression Analysis, p. 245-246).



Before exploring this, how is R2 computed and how is it interpreted?

While I haven’t discussed the decomposition of the Error SS into Lack-of-Fit and Pure error, this canbe done when there are replicated X values. A prototype ANOVA table would look something like:

Source df SS

Regression p− 1 A

Lack-of-fit n− p− ne B

Pure error ne C

Corrected Total n-1 D

where there are n observations and a regression model is fit with p additional X values over and abovethe intercept.

R2 is computed as

R2 =SS(regression)

SS(total)=A

D= 1− B + C

D

where SS(·) represents the sum of squares for that term in the ANOVA table. At this point, rerun thethree examples presented earlier to find the value of R2.

For example, in the fertilizer example, the ANOVA table is:

Analysis of VarianceSource DF Sum of Squares Mean Square F Ratio p-valueModel 1 225.18035 225.180 69.8800 <.0001Error 9 29.00147 3.222C. Total 10 254.18182

Here R2 = 225.18035/254.18182 = .885 = 88.5%.

R2 is interpreted as the proportion of variance in Y accounted for by the regression. In this case,almost 90% of the variation in Y is accounted for by the regression. The value ofR2 must range between0 and 1.

It is tempting to think that R2 must be measure of the “goodness of fit”. In a technical sense it is, butR2 is not a very good measure of fit, and other characteristics of the regression equation are much moreinformative. In particular, the estimate of the slope and the se of the slope are much more informative.

Here are some reasons, why I decline to use R2 very much:

• Overfitting. If there are no replicate X points, then ne = 0, C = 0, and R2 = 1 − BD . B has

n − p degrees of freedom. As more and more X variables are added to the model, n − p, and Bbecome smaller, and R2 must increase even if the additional variables are useless.

• Outliers distort. Outliers produce Y values that are extreme relative to the fit. This can inflate thevalue of C (if the outlier occurs among the set of replicate X values), or B if the outlier occurs ata singleton X value. In any cases, they reduce R2, so R2 is not resistant to outliers.

• People misinterpret high R2 as implying the regression line is useful. It is tempting to believethat a higher value of R2 implies that a regression line is more useful. But consider the pair ofplots below:



The graph on the left has a very high R2, but the change in Y as X varies is negligible. The graphon the right has a lower R2, but the average change in Y per unit change in X is considerable. R2

measures the “tightness” of the points about the line – the higher value of R2 on the left indicatesthat the points fit the line very well. The value of R2 does NOT measure how much actual changeoccurs.

• Upper bound is not always 1. People often assume that a low R2 implies a poor fitting line. Ifyou have replicate X values, then C > 0. The maximum value of R2 for this problem can bemuch less than 100% - it is mathematically impossible for R2 to reach 100% with replicated Xvalues. In the extreme case where the model “fits perfectly” (i.e. the lack of fit term is zero), R2

can never exceed 1− CD .

• No intercept models If there is no intercept then D =∑

(Yi − Y )2 does not exist, and R2 is notreally defined.

• R2 gives no additional information. In actual fact, R2 is a 1-1 transformation of the slope andits standard error, as is the p-value. So there is no new information in R2.

• R2 is not useful for non-linear fits. R2 is really only useful for linear fits with the estimatedregression line free to have a non-zero intercept. The reason is that R2 is really a comparisonbetween two types of models. For example, refer back to the length-weight relationship examinedearlier.

In the linear fit case, the two models being compared are

log(weight) = log(b0) + error

vs.log(weight) = log(b0) + b1 ∗ log(length) + error

and so R2 is a measure of the improvement with the regression line. [In actual fact, it is a 1-1transform of the test that β1 = 0, so why not use that statistics directly?]. In the non-linear fit case,the two models being compared are:

weight = 0 + error

vs.weight = b0 ∗ length ∗ ∗b1 + error

The model weight=0 is silly and so R2 is silly.

Hence, theR2 values reported are really all for linear fits - it is just that sometimes the actual linearfit is hidden.

• Not defined in generalized least squares. There are more complex fits that don’t assume equalvariance around the regression line. In these cases, R2 is again not defined.



• Cannot be used with different transformations of Y . R2 cannot be used to compare modelsthat are fit to different transformations of the Y variable. For example, many people try fitting amodel to Y and to log(Y ) and choose the model with the highest R2. This is not appropriate asthe D terms are no longer comparable between the two models.

• Cannot be used for non-nested models. R2 cannot be used to compare models with differentsets of X variables unless one model is nested within another model (i.e. all of the X variables inthe smaller model also appear in the larger model). So using R2 to compare a model with X1, X3,and X5 to a model with X1, X2, and X4 is not appropriate as these two models are not nested. Inthese cases, AIC should be used to select among models.

14.5 A no-intercept model: Fulton’s Condition Factor K

It is possible to fit a regression line that has an intercept of 0, i.e., goes through the origin. Most computerpackages have an option to suppress the fitting of the intercept.

The biggest ‘problem’ lies in interpreting some of the output – some of the statistics produced aremisleading for these models. As this varies from package to package, please seek advice when fittingsuch models.

The following is an example of where such a model may be sensible.

Not all fish within a lake are identical. How can a single summary measure be developed to representthe condition of fish within a lake?

In general, the the relationship between fish weight and length follows a power law:

W = aLb

where W is the observed weight; L is the observed length, and a and b are coefficients relating lengthto weight. The usual assumption is that heavier fish of a given length are in better condition than thanlighter fish. Condition indices are a popular summary measure of the condition of the population.

There are at least eight different measures of condition which can be found by a simple literaturesearch. Conne (1989) raises some important questions about the use of a single index to represent thetwo-dimensional weight-length relationship.

One common measure is Fulton’s12 K:

K =Weight

(Length/100)3

This index makes an implicit assumption of isometric growth, i.e. as the fish grows, its body proportionsand specific gravity do not change.

How can K be computed from a sample of fish, and how can K be compared among different subsetof fish from the same lake or across lakes?

The B.C. Ministry of Environment takes regular samples of rainbow trout using a floating and asinking net. For each fish captured, the weight (g), length (mm), sex, and maturity of the fish wasrecorded.

12There is some doubt about the first authorship of this condition factor. See Nash, R. D. M., Valencia, A. H., and Geffen, A. J.(2005). The Origin of Fulton’s Condition Factor – Setting the Record Straight. Fisheries, 31, 236-238.



The data is available in the rainbow-condition.csv file in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets.

The data are imported into R using the read.csv() function:

fish <- read.csv("rainbow-condition.csv", header=TRUE, as.is=TRUE, strip.white=TRUE)fish$K <- fish$Weight..g./(fish$Len..mm./100)**3fish$SpeciesF <- factor(fish$Species)fish$MaturityF <- factor(fish$Maturity)fish$SexF <- factor(fish$Sex)fish[1:5,]

Part of the raw data are shown below:

Net.Type Fish Len..mm. Weight..g. Species Sex Maturity K SpeciesF1 Sinking 1 360 686 RB F MATURING 14.70336 RB2 Sinking 2 385 758 RB F MATURING 13.28272 RB3 Sinking 3 295 284 RB M MATURING 11.06247 RB4 Sinking 4 285 292 RB F MATURING 12.61387 RB5 Sinking 5 380 756 RB F MATURING 13.77752 RB

MaturityF SexF1 MATURING F2 MATURING F3 MATURING M4 MATURING F5 MATURING F

K was computed for each individual fish, and the resulting histogram is displayed below:

histplot <- ggplot(data=fish, aes(x=K))+ggtitle("Histogram of condition factor K")+geom_histogram()

histplot





There is a range of condition numbers among the individual fish with an average (among the fish caught)K of about 13.6.

Deriving a single summary measure to represent the entire population of fish in the lake dependsheavily on the sampling design used to capture fish.

Some case must be taken to ensure that the fish collected are a simple random sample from the fishin the population. If a net of a single mesh size are used, then this has a selectivity curve and the nets aretypically more selective for fish of a certain size. In this experiment, several different mesh sizes wereused to try and ensure that all fish of all sizes have an equal chance of being selected.

As well, if regression methods have an advantage in that a simple random sample from the populationis no longer required to estimate the regression coefficients. As an analogy, suppose you are interestedin the relationship between yield of plants and soil fertility. Such a study could be conducted by findinga random sample of soil plots, but this may lead to many plots with similar fertility and only a few plotswith fertility at the tails of the relationship. An alternate scheme is to deliberately seek out soil plots witha range of fertilities or to purposely modify the fertility of soil plots by adding fertilizer, and then fit aregression curve to these selected data points.

Fulton’s index is often re-expressed for regression purposes as:

W = K

(L

100

)3

This looks like a simple regression between W and(L

100

)3but with no intercept.

A plot of these two variables:

fish$W <- fish$Weight..g.



fish$L <- (fish$Len..mm./100)**3wplot <- ggplot(data=fish, aes(x=L, y=W))+

ggtitle("W vs (L/100)**3")+xlab("(L/100)**3")+xlab("(L/100)**3")+geom_point()

wplot

shows a tight relationship among fish but with possible increasing variance with length.

There is some debate about the proper way to estimate the regression coefficientK. Classical regres-sion methods (least squares) implicitly implies that all of the “error” in the regression is in the verticaldirection, i.e. conditions on the observed lengths. However, the structural relationship between weightand length likely is violated in both variables. This would lead to the error-in-variables problem inregression, which has a long history. Fortunately, the relationship between the two variables is oftensufficiently tight that it really doesn’t matter which method is used to find the estimates.

This model is fit in R using the lm() function. Notice the way that the no-intercept model is specifiedin the formula:

k.fit <- lm(W ~ -1 +L, data=fish)summary(k.fit)

Call:lm(formula = W ~ -1 + L, data = fish)

Residuals:



Min 1Q Median 3Q Max-104.65 -13.71 -0.93 10.54 98.56


L 13.72947 0.09878 139 <2e-16 ***---Signif. codes: 0 ’***’ 0.001 ’**’ 0.01 ’*’ 0.05 ’.’ 0.1 ’ ’ 1

Residual standard error: 31.3 on 122 degrees of freedomMultiple R-squared: 0.9937, Adjusted R-squared: 0.9937F-statistic: 1.932e+04 on 1 and 122 DF, p-value: < 2.2e-16

Note thatR2 really doesn’t make sense in cases where the regression is forced through the origin becausethe null model to which it is being compared is the line Y = 0 which is silly.13

The estimated value of K is 13.72 (SE 0.099).

The residual plot:

diagplot <- autoplot(k.fit)diagplot

13 Consult any of the standard references on regression such as Draper and Smith for more details.



shows clear evidence of increasing variation with the length variable. This usually implies that a weightedregression is needed with weights proportional to the 1/length2 variable. In this case, such a regressiongives essentially the same estimate of the condition factor (K = 13.67, SE = .11).

Comparing condition factors

This dataset has a number of sub-groups – do all of the subgroups have the same condition factor?For example, suppose we wish to compare the K value for immature and mature fish. This is covered inmore detail in the Chapter on the Analysis of Covariance (ANCOVA).

14.6 Frequent Asked Questions - FAQ

14.6.1 Do I need a random sample; power analysis

A student wrote:



I am studying the hydraulic geometry of small, steep streams in Southwest BC (abstractattached). I would like to define a regional hydraulic geometry for a fairly small hydro-logic/geologic homogeneous area in the coast mountains close to SFU. Hydraulic geometryis the study of how the primary flow variables (width, depth and velocity) change with dis-charge in a stream. Typically, a straight-regression line is fitted to data plotted on a log-logplot. The equation is of the form w = aQb where a is the intercept, b is the slope, w is thewater surface width, and Q is the stream discharge.

I am struggling with the last part of my research proposal which is how do I select (ran-domly) my field sites and how many sites are required. My supervisor - suggests that Iselect stream segments for study based on a-priori knowledge of my field area and selectstreams from across it. My argument is that to define a regionally applicable relationship(not just one that characterizes my chosen sites) I must randomly select the sites.

I think that GIS will help me select my sites but have the usual questions of how manysites are required to give me a certain level of confidence and whether or not I’m on theright track. As well, the primary controlling variables that I am looking at are dischargeand stream slope. I will be plotting the flow variables against discharge directly but willdeal with slope by breaking my stream segments into slope classes - I guess that the nullhypothesis would be that there is no difference in the exponents and intercepts between slopeclasses.

You are both correct!

If you were doing a simple survey, then you are correct in that a random sample from the entirepopulation must be selected - you can’t deliberately choose streams.

However, because you are interested in a regression approach, the assumption can be relaxed a bit.You can deliberately choose values of the X variables, but must randomly select from streams withsimilar X values.

As an analogy, suppose you wanted to estimate the average length of male adult arms. You wouldneed a random sample from the entire population. However, suppose that you were interested in therelationship between body height (X) and arm length (Y ). You could deliberately choose which Xvalues to measure - indeed it would be good idea to get a good contrast among the X values, i.e. findpeople who are 4 ft tall, 5 ft tall, 6 ft tall, 7 ft tall and measure their height and arm length and then fitthe regression curve. However, at each height level, you must now choose randomly among those peoplethat meet that criterion. Hence you could could deliberately choose to have 1/4 of people who are 4 fttall, 1/4 who are 5 feet tall, 1/4 who are 6 feet tall, 1/4 who are 7 feet tall which is quite different from theproportions in the population, but at each height level must choose people randomly, i.e. don’t alwayschoose skinny 4 ft people, and over-weight 7 ft people.

Now sample size is a bit more difficult as the required sample size depends both on the number ofstreams selected and how they are scattered along the X axis. For example, the highest power occurswhen observations are evenly divided between the very smallest X and very largest X value. However,without intermediate points, you can’t assess linearity very well. So you will want points scatteredaround the range of X values.

If you have some preliminary data, a power/sample size can be done using JMP, SAS, and otherpackages. If you do a google search for power analysis regression, there are several direct links toexamples. Refer to the earlier section of the notes.



14.7 Summary of simple linear regression

Objective: Estimate the slope (the relationship) between two continuous variables.

Experimental Design

• A Y is observed for a given X . The X’s can be randomly chosen from a population or set as partof an experiment.

• No pairing, blocking, or stratification.

• Ensure that the Experimental Unit (EU) is the same as the observational units (OU) to avoidpseudoreplication. This is particularly true in trend analysis when you should generally have 1number per year.

Data structureTwo columns in the dataset.

• X column for the predictor. This should be numeric.

• Y column for the response. This should be numeric.

The columns can be an any order. The rows can be in any order.The dataset is read in the usual ways.

Missing values/ UnbalanceIf there are missing values, ensure that they are Missing Completely at Random (MCAR). Any row witha missing value in Y or X will not be used in the regression.

Preliminary PlotA scatterplot of Y vs. X .Check for outliers. Check for leverage points. Check for non-linear relationships.

Table 1Compute a table of sample size, mean, and SD for each variable.

AnalysisAnalysis code:

• JMP: Analyze->Fit Y-by-X. Choose Fit Line from the drop down menu. Or Analyze->Fit Model.

• SAS: Proc Reg data=blah; Model Y= X

• R: lm( Y ∼ X, data=blah)

FollowupMake predicitons at new value of X .Be careful to distinguish between confidence intervals for the MEAN response and prediction intervalsfor the INDIVIDUAL response.

• R:



– newX <- data.frame(xvar=c(x1, x2, x3, ...)

– pred.avg <- predict(fit, newdata=newX, type="confidence") # ci for mean

– pred.indiv <- predict(fit, newdata=newX, type="prediction") # pi for individual values

Don’t forget model assessment. Look at residual and other plots. If the X variable is time, look atautocorrelation using the Durbin Watson statistic or similar diagnostic plots of lagged residuals.

What to write

• Materials and Methods: A simple linear regression was used to examine the relationship betweenY and X .

• Results: We found evidence of a relationship between Y and X (p = xxxx).

• Results: We found no evidence of a relationship between Y and X (p = xxxx).

• Results: The estimated slope is xxx (SE xxxx) units of Y per unit of X .

Report the actual p-value and not just if it was < .05 or > 0.05. Even if you fail to detect an effect, stillreport the effect sizes and their standard errors so that the reader can determine if the close to 0 (with asmall se) or if the se is so large that nothing was learned from the study.

Power AnalysisHard. Contact me.

CommentsThe terms independent and dependent variable are old-fashioned and should not be used. The preferredterms are predictor and response variable.

A very common error is to confuse the confidence interval for the mean and the prediction interval foran individual response and use the wrong interval.

Be VERY AFRAID of pseudoreplication especially if you are doing a regression against time (a trendanalysis). Consult the relevant chapter for more details..

R2 is not as useful as you might expect. R2 measures the closeness of points to the line but not the sizeof the slope. So you can all of the points close to the line (high R2 but a slope that is close to 0. Becautious in using R2.

It is VERY rare to do a regression with no intercept. Contact me.

A log(Y ) transform is very common. It is usually NOT necessary to transform any X variable.


correlation and simple linear...

Documents