correlation and simple linear...

Chapter 14

Correlation and simple linearregression

Contents14.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85814.2 Graphical displays . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 859

14.2.1 Scatterplots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 85914.2.2 Smoothers . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 860

14.3 Correlation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86314.3.1 Scatter-plot matrix . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86314.3.2 Correlation coefficient . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86614.3.3 Cautions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 86714.3.4 Principles of Causation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 870

14.4 Single-variable regression . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87114.4.1 Introduction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87114.4.2 Equation for a line - getting notation straight (no pun intended) . . . . . . . . 87214.4.3 Populations and samples . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87214.4.4 Assumptions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87314.4.5 Obtaining Estimates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87614.4.6 Obtaining Predictions . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87714.4.7 Residual Plots . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 87814.4.8 Example - Yield and fertilizer . . . . . . . . . . . . . . . . . . . . . . . . . . 87814.4.9 Example - Mercury pollution . . . . . . . . . . . . . . . . . . . . . . . . . . 88714.4.10 Example - The Anscombe Data Set . . . . . . . . . . . . . . . . . . . . . . . 89514.4.11 Transformations . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89614.4.12 Example: Monitoring Dioxins - transformation . . . . . . . . . . . . . . . . . 89714.4.13 Example: Weight-length relationships - transformation . . . . . . . . . . . . . 90614.4.14 Power/Sample Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 91514.4.15 The perils of R2 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 919

14.5 A no-intercept model: Fulton’s Condition Factor K . . . . . . . . . . . . . . . . . 92214.6 Frequent Asked Questions - FAQ . . . . . . . . . . . . . . . . . . . . . . . . . . . 927

14.6.1 Do I need a random sample; power analysis . . . . . . . . . . . . . . . . . . 92714.7 Summary of simple linear regression . . . . . . . . . . . . . . . . . . . . . . . . . 928

857

CHAPTER 14. CORRELATION AND SIMPLE LINEAR REGRESSION

The suggested citation for this chapter of notes is:

Schwarz, C. J. (2019). Correlation and simple linear regression.In Course Notes for Beginning and Intermediate Statistics.Available at http://www.stat.sfu.ca/~cschwarz/CourseNotes. Retrieved2019-11-03.

14.1 Introduction

A nice book explaining how to use JMP to perform regression analysis is: Freund, R., Littell, R., andCreighton, L. (2003) Regression using JMP. Wiley Interscience.

Much of statistics is concerned with relationships among variables and whether observed relation-ships are real or simply due to chance. In particular, the simplest case deals with the relationship betweentwo variables.

Quantifying the relationship between two variables depends upon the scale of measurement of eachof the two variables. The following table summarizes some of the important analyses that are oftenperformed to investigate the relationship between two variables.

Type of vari-ables

X is Interval or Ratio orwhat JMP calls Continu-ous

X is Nominal or Ordinal

Y is Inter-val or Ratioor what JMPcalls Contin-uous

• Scatterplots

• Running me-dian/spline fit

• Regression

• Correlation

• Side-by-side dot plot

• Side-by-side boxplot

• ANOVA or t-tests

Y is Nomi-nal or Ordi-nal

• Logistic regression • Mosaic chart

• Contingency tables

• Chi-square tests

In JMP these combination of two variables are obtained by the Analyze->Fit Y-by-X platform, theAnalyze->Correlation-of-Ys platform, or the Analyze->Fit Model platform.

When analyzing two variables, one question becomes important as it determines the type of analysisthat will be done. Is the purpose to explore the nature of the relationship, or is the purpose to use onevariable to explain variation in another variable? For example, there is a difference between examiningheight and weight to see if there is a strong relationship, as opposed to using height to predict weight.

c©2019 Carl James Schwarz 858 2019-11-03

http://www.stat.sfu.ca/~cschwarz/CourseNotes


Consequently, you need to distinguish between a correlational analysis in which only the strength ofthe relationship will be described, or regression where one variable will be used to predict the values ofa second variable.

The two variables are often called either a response variable or an explanatory variable. A responsevariable (also known as a dependent or Y variable) measures the outcome of a study. An explana-tory variable (also known as an independent or X variable) is the variable that attempts to explain theobserved outcomes.

14.2 Graphical displays

14.2.1 Scatterplots

The scatter-plot is the primary graphical tool used when exploring the relationship between two intervalor ratio scale variables. This is obtained in JMP using the Analyze->Fit Y-by-X platform – be sure thatboth variables have a continuous scale.

In graphing the relationship, the response variable is usually plotted along the vertical axis (the Yaxis) and the explanatory variables is plotted along the horizontal axis (the X axis). It is not alwaysperfectly clear which is the response and which is the explanatory variable. If there is no distinctionbetween the two variables, then it doesn’t matter which variable is plotted on which axis – this usuallyonly happens when finding correlation between variables is the primary purpose.

For example, look at the relationship between calories/serving and fat from the cereal dataset usingJMP. [We will create the graph in class at this point.]

What to look for in a scatter-plot

Overall pattern. - What is the direction of association? A positive association occurs when above-average values of one variable tend to be associated with above-average variables of another. Theplot will have an upward slope. A negative association occurs when above-average values ofone variable are associated with below-average values of another variable. The plot will have adownward slope. What happens when there is “no association” between the two variables?

Form of the relationship. Does a straight line seem to fit through the ‘middle’ of the points? Is the linelinear (the points seem to cluster around a straight line?) or is it curvi-linear (the points seem toform a curve)?

Strength of association. Are the points clustered tightly around the curve? If the points have a lotof scatter above and below the trend line, then the association is not very strong. On the otherhand, if the amount of scatter above and below the trend line is very small, then there is a strongassociation.

Outliers Are there any points that seem to be unusual? Outliers are values that are unusually far fromthe trend curve - i.e., they are further away from the trend curve than you would expect from theusual level of scatter. There is no formal rule for detecting outliers - use common sense. [If youset the role of a variable to be a label, and click on points in a linked graph, the label for the pointwill be displayed making it easy to identify such points.]

One’s usual initial suspicion about any outlier is that it is a mistake, e.g., a transcription error.Every effort should be made to trace the data back to its original source and correct the value ifpossible. If the data value appears to be correct, then you have a bit of a quandary. Do you keep



the data point in even though it doesn’t follow the trend line, or do you drop the data point becauseit appears to be anomalous? Fortunately, with computers it is relatively easy to repeat an analysiswith and without an outlier - if there is very little difference in the final outcome - don’t worryabout it.

In some cases, the outliers are the most interesting part of the data. For example, for many yearsthe ozone hole in the Antarctic was missed because the computers were programmed to ignorereadings that were so low that ‘they must be in error’!

Lurking variables. A lurking variable is a third variable that is related to both variables and may con-found the association.

For example, the amount of chocolate consumed in Canada and the number of automobile acci-dents are positively related, but most people would agree that this is coincidental and each variableis independently driven by population growth.

Sometimes the lurking variable is a ’grouping’ variable of sorts. This is often examined by usinga different plotting symbol to distinguish between the values of the third variables. For example,consider the following plot of the relationship between salary and years of experience for nurses.

The individual lines show a positive relationship, but the overall pattern when the data are pooled,shows a negative relationship.

It is easy in JMP to assign different plotting symbols (what JMP calls markers) to different points.From the Row menu, use Where to select rows. Then assign those rows using the Rows->Markersmenu.

14.2.2 Smoothers

Once the scatter-plot is plotted, it is natural to try and summarize the underlying trend line. For example,consider the following data:



There are several common methods available to fit a line through this data.

By eye The eye has remarkable power for providing a reasonable approximation to an underlyingtrend, but it needs a little education. A trend curve is a good summary of a scatter-plot if the differencesbetween the individual data points and the underlying trend line (technically called residuals) are small.As well, a good trend curve tries to minimize the total of the residuals. And the trend line should try andgo through the middle of most of the data.

Although the eye often gives a good fit, different people will draw slightly different trend curves.Several automated ways to derive trend curves are in common use - bear in mind that the best ways ofestimating trend curves will try and mimic what the eye does so well.

Median or mean trace The idea is very simple. We choose a “window” width of size w, say. Foreach point along the bottom (X) axis, the smoothed value is the median or average of the Y -values forall data points with X-values lying within the “window” centered on this point. The trend curve is thenthe trace of these medians or means over the entire plot. The result is not exactly smooth. Generally,the wider the window chosen the smoother the result. However, wider windows make the smootherreact more slowly to changes in trend. Smoothing techniques are too computationally intensive to beperformed by hand. Unfortunately, JMP is unable to compute the trace of data, but splines are a verygood alternative (see below).

The mean or median trace is too unsophisticated to be a generally useful smoother. For example,the simple averaging causes it to under-estimate the heights of peaks and over-estimate the heights oftroughs. (Can you see why this is so? Draw a picture with a peak.) However, it is a useful way of tryingto summarize a pattern in a weak relationship for a moderately large data set. In a very weak relationshipit can even help you to see the trend.

Box plots for strips The following gives a conceptually simple method which is useful for exploringa weak relationship in a large data set. The X-axis is divided into equal-sized intervals. Then separatebox plots of the values of Y are found for each strip. The box-plots are plotted side-by-side and the meansor median are joined. Again, we are able to see what is happening to the variability as well as the trend.There is even more detailed information available in the box plots about the shape of the Y -distributionetc. Again, this is too tedious to do by hand. It is possible to do make this plot in JMP by creating a newvariable that groups the values of the X variable into classes and then using the Analyze->Fit Y-by-X



platform using these groupings. This is illustrated below:

Spline methods A spline is a series of short smooth curves that are joined together to create a largersmooth curve. The computational details are complex, but can be done in JMP. The stiffness of thespline indicates how straight the resulting curve will be. The following shows two spline fits to the samedata with different stiffness measures:



14.3 Correlation

WARNING!: Correlation is probably the most abused concept in statistics. Many people usethe word ‘correlation’ to mean any type of association between two variables, but it has a very stricttechnical meaning, i.e. the strength of an apparent linear relationship between the two interval or ratioscaled variables.

The correlation measure does not distinguish between explanatory and response variables and ittreats the two variables symmetrically. This means that the correlation between Y and X is the same asthe correlation between X and Y.

Correlations are computed in JMP using the Analyze->Correlation of Y’s platform. If there areseveral variables, then the data will be organized into a table. Each cell in the table shows the correla-tion of the two corresponding variables. Because of symmetry (the correlation between variable1 andvariable2 is the same as between variable2 and variable1), only part of the complete matrix will beshown. As well, the correlation between any variable and itself is always 1.

14.3.1 Scatter-plot matrix

To illustrate the ideas of correlation, look at the FITNESS dataset in the DATAMORE directory of JMP.This is a dataset on 31 people at a fitness centre and the following variables were measured on eachsubject:

• name

• gender



• age

• weight

• oxygen consumption (high values are typically more fit people)

• time to run one mile (1.6 km)

• average pulse rate during the run

• the resting pulse rate

• maximum pulse rate during the run.

We are interested in examining the relationship among the variables. At the moment, ignore the factthat the data contains both genders. [It would be interesting to assign different plotting symbols to thetwo genders to see if gender is a lurking variable.]

One of the first things to do is to create a scatter-plot matrix of all the variables. Use the Analyze->Correlation of Ys to get the following scatter-plot:



Interpreting the scatter plot matrix

The entries in the matrix are scatter-plots for all the pairs of variables. For example, the entry in row1 column 3 represents the scatter-plot between age and oxygen consumption with age along the verticalaxis and oxygen consumption along the horizontal axis, while the entry in row 3 column 1 has age alongthe horizontal axis and oxygen consumption along the vertical axis.

There is clearly a difference in the ’strength’ of relationships. Compare the scatter plot for averagerunning pulse rate and maximum pulse rate (row 5, column 7) to that of running pulse rate and restingpulse rate (row 5 column 6) to that of running pulse rate and weight (row 5 column 2).

Similarly, there is a difference in the direction of association. Compare the scatter plot for the averagerunning pulse rate and maximum pulse rate (row 5 column 7) and that for oxygen consumption andrunning time (row 3, column 4).



14.3.2 Correlation coefficient

It is possible to quantify the strength of association between two variables. As with all statistics, the waythe data are collected influences the meaning of the statistics.

The population correlation coefficient between two variables is denoted by the Greek letter rho (ρ)and is computed as:.

ρ =1

N

N∑i=1

(Xi − µX)

σx

(Yi − µY )

σy

The corresponding sample correlation coefficient is denoted r has a similar form:1

r =1

n− 1

n∑i=1

(Xi −X

)sx

(Yi − Y

)sy

If the sampling scheme is simple random sample from the corresponding population, then r is anestimate of ρ. This is a crucial assumption. If the sampling is not a simple randomsample, the above definition of the sample correlation coefficient should not be used! It is possible tofind a confidence interval for ρ and to perform statistical tests that ρ is zero. However, for the most part,these are rarely done in ecological research and so will not be pursued further in this course.

The form of the formula does provide some insight into interpreting its value.

• ρ and r (unlike other population parameters) are unitless measures.

• the sign of ρ and r is largely determined by the pairing of the relationship of each of the (X,Y)values with their respective means, i.e. if both of X and Y are above the mean, or both X and Y arebelow their mean, this pair contributes a positive value towards ρ or r, while if X is above and Yis below, or X is below and Y is above their respective means contributes a negative value towardsρ or r.

• ρ and r ranges from -1 to 1. A value of ρ or r equal to -1 implies a perfect negative correlation; avalue of ρ or r equal to 1 implies a perfect positive correlation; a value of ρ or r equal to 0 impliedno correlation. A perfect population correlation (i.e. ρ or r equal to 1 or -1) implies that all pointslie exactly on a straight line, but the slope of the line has NO effect on the correlation coefficient.This latter point is IMPORTANT and often is wrongly interpreted - give some examples.

• ρ and r are unaffected by linear transformations of the individual variables, e.g. unit changes suchas converting from imperial to metric units.

• ρ and r only measures the linear association, and is not affected by the slope of the line, but onlyby the scatter about the line.

Because correlation assumes both variables have an interval or ratio scale, it makes no sense tocompute the correlation

• between gender and oxygen (gender is a nominal scale data);

• between non-linear variables (not shown on graph);

1 Note that this formula SHOULD NOT be used for the actual computation of r, it is numerically unstable and there are bettercomputing formulae available.



• for data collected without a known probability scheme. If a sampling scheme other than simplerandom sampling is used, it is possible to modify the estimation formula; if a non-probabilitysample scheme was used, the patient is dead on arrival, and no amount of statistical wizardry willrevive the corpse.

The data collection scheme for the fitness data set is unknown - we will have to assume that a somesort of random sample form the relevant population was taken before we can make much sense of thenumber computed.

Before looking at the details of its computation, look at the sample correlation coefficients for eachscatter plot above. These can be arranged into a matrix:

Variable Age Weight Oxy Runtime RunPulse RstPulse MaxPulse

Age 1.00 -0.24 -0.31 0.19 -0.31 -0.15 -0.41

Weight -0.24 1.00 -0.16 0.14 0.18 0.04 0.24

Oxy -0.31 -0.16 1.00 -0.86 -0.39 -0.39 -0.23

Runtime 0.19 0.14 -0.86 1.00 0.31 0.45 0.22

RunPulse -0.31 0.18 -0.39 0.31 1.00 0.35 0.92

RstPulse -0.15 0.04 -0.39 0.45 0.35 1.00 0.30

MaxPulse -0.41 0.24 -0.23 0.22 0.92 0.30 1.00

Notice that the sample correlation between any two variables is the same regardless of ordering ofthe variables – this explains the symmetry in the matrix between the above and below diagonal elements.As well each variable has a perfect sample correlation with itself – this explains the value of 1 along themain diagonal.

Compare the sample correlations between the average running pulse rate and the other variables andcompare them to the corresponding scatter-plot above.

14.3.3 Cautions

• Random Sampling Required Sample correlation coefficients are only valid under simple ran-dom samples. If the data were collected in a haphazard fashion or if certain data points wereoversampled, then the correlation coefficient may be severely biased.

• There are examples of high correlation but no practical use and low correlation but great practicaluse. These will be presented in class. This illustrates why I almost never talk about correlation.

• correlation measures ‘strength’ of a linear relationship; a curvilinear relationship may have acorrelation of 0, but there will still be a good correlation.



• the effect of outliers and high leverage points will be presented in class

• effects of lurking variables. For example, suppose there is a positive association between wagesof male nurses and years of experience; between female nurses and years of experience; but malesare generally paid more than females. There is a positive correlation within each group, but anoverall negative correlation when the data are pooled together.



• ecological fallacy - the problem of correlation applied to averages. Even if there is a high correla-tion between two variables on their averages, it does not imply that there is a correlation betweenindividual data values.

For example, if you look at the average consumption of alcohol and the consumption of cigarettes,there is a high correlation among the averages when the 12 values from the provinces and territoriesare plotted on a graph. However, the individual relationships within provinces can be reversed ornon-existent as shown below:

The relationship between cigarette consumption and alcohol consumption shows no relationship



for each province, yet there is a strong correlation among the per-capita averages. This is anexample of the ecological fallacy.

• correlation does not imply causation. This is the most frequent mistake made by people. Thereare set of principles of causal inference that need to be satisfied in order to imply cause and effect.

14.3.4 Principles of Causation

Types of association

An association may be found between two variables for several reasons (show causal modeling fig-ures):

• There may be direct causation, e.g. smoking causes lung cancer.

• There may be a common cause, e.g. ice cream sales and number of drownings both increase withtemperature.

• There may be a confounding factor, e.g. highway fatalities decreased when the speed limits werereduced to 55 mph at the same time that the oil crisis caused supplies to be reduced and peopledrove fewer miles.

• There may be a coincidence, e.g., the population of Canada has increased at the same time as themoon has gotten closer by a few miles.

Establishing cause-and effect.

How do we establish a cause and effect relationship? Bradford Hill (Hill, A. B.. 1971. Principles ofMedical Statistics, 9th ed New York: Oxford University Press) outlined 7 criteria that have been adoptedby many epidemiological researchers. It is generally agreed that most or all of the following must beconsidered before causation can be declared.

Strength of the association. The stronger an observed association appears over a series of differentstudies, the less likely this association is spurious because of bias.

Dose-response effect. The value of the response variable changes in a meaningful way with the dose(or level) of the suspected causal agent.

Lack of temporal ambiguity. The hypothesized cause precedes the occurrence of the effect. The abilityto establish this time pattern will depend upon the study design used.

Consistency of the findings. Most, or all, studies concerned with a given causal hypothesis producesimilar findings. Of course, studies dealing with a given question may all have serious bias prob-lems that can diminish the importance of observed associations.

Biological or theoretical plausibility. The hypothesized causal relationship is consistent with currentbiological or theoretical knowledge. Note, that the current state of knowledge may be insufficientto explain certain findings.

Coherence of the evidence. The findings do not seriously conflict with accepted facts about the out-come variable being studied.

Specificity of the association. The observed effect is associated with only the suspected cause (or fewother causes that can be ruled out).



IMPORTANT: NO CAUSATION WITHOUT MANIPULATION!

Examples:

Discuss the above in relation to:

• amount of studying vs. grades in a course.

• amount of clear cutting and sediments in water.

• fossil fuel burning and the greenhouse effect.

14.4 Single-variable regression

14.4.1 Introduction

Along with the Analysis of Variance, this is likely the most commonly used statistical methodology inecological research. In virtually every issue of an ecological journal, you will find papers that use aregression analysis.

There are HUNDREDS of books written on regression analysis. Some of the better ones (IMHO)are:

Draper and Smith. Applied Regression Analysis. Wiley.Neter, Wasserman, and Kutner. Applied Linear Statistical Models. Irwin.Kleinbaum, Kupper, Miller. Applied Regression Analysis. Duxbury.Zar. Biostatistics. Prentice Hall.

Consequently, this set of notes is VERY brief and makes no pretense to be a thorough review ofregression analysis. Please consult the above references for all the gory details.

It turns out that both Analysis of Variance and Regression are special cases of a more general statis-tical methodology called General Linear Models which in turn are special cases of Generalized LinearModels which in turn are special cases of Generalized Additive Models, which in turn are special casesof .....

The key difference between a Regression analysis and an ANOVA is that the X variable is nominalscaled in ANOVA, while in regression analysis the X variable is continuous scaled. This implies that inANOVA, the shape of the response profile was unspecified (the null hypothesis was that all means wereequal while the alternate was that at least one mean differs), while in regression, the response profilemust be a linear line.

Because both ANOVA and regression are from the same class of statistical models, many of theassumptions are similar, the fitting methods are similar, hypotheses testing and inference are similar aswell.



14.4.2 Equation for a line - getting notation straight (no pun intended)

In order to use regression analysis effectively, it is important that you understand the concepts of slopesand intercepts and how to determine these from data values.

This will be QUICKLY reviewed here in class.

In previous courses at high school or in linear algebra, the equation of a straight line was often writteny = mx + b where m is the slope and b is the intercept. In some popular spreadsheet programs, theauthors decided to write the equation of a line as y = a+ bx. Now a is the intercept, and b is the slope.Statisticians, for good reasons, have rationalized this notation and usually write the equation of a lineas y = β0 + β1x or as Y = b0 + b1X . (the distinction between β0 and b0 will be made clearer in afew minutes). The use of the subscripts 0 to represent the intercept and the subscript 1 to represent thecoefficient for the X variable then readily extends to more complex cases.

Review definition of intercept as the value of Y when X=0, and slope as the change in Y per unitchange in X .

14.4.3 Populations and samples

All of statistics is about detecting signals in the face of noise and in estimating population parametersfrom samples. Regression is no different.

First consider the the population. As in previous chapters, the correct definition of the population isimportant as part of any study. Conceptually, we can think of the large set of all units of interest. On eachunit, there is conceptually, both an X and Y variable present. We wish to summarize the relationshipbetween Y andX , and furthermore wish to make predictions of the Y value for futureX values that maybe observed from this population. [This is analogous to having different treatment groups correspondingto different values of X in ANOVA.]

If this were physics, we may conceive of a physical law between X and Y , e.g. F = ma orPV = nRt. However, in ecology, the relationship between Y and X is much more tenuous. If youcould draw a scatter-plot of Y against X for ALL elements of the population, the points would NOT fallexactly on a straight line. Rather, the value of Y would fluctuate above or below a straight line at anygiven X value. [This is analogous to saying that Y varies randomly around the treatment group mean inANOVA.]

We denote this relationship asY = β0 + β1X + ε

where now β0, β1 are the POPULATION intercept and slope respectively. We say that

E[Y ] = β0 + β1X

is the expected or average value of Y at X . [In ANOVA, we let each treatment group have its own mean;here in regression we assume that the means must fit on a straight line.]

The term ε represent random variation of individual units in the population above and below theexpected value. It is assumed to have constant standard deviation over the entire regression line (i.e. thespread of data points in the population is constant over the entire regression line). [This is analogous tothe assumption of equal treatment population standard deviations in ANOVA.]

Of course, we can never measure all units of the population. So a sample must be taken in order to



estimate the population slope, population intercept, and population standard deviation. Unlike a correla-tion analysis, it is NOT necessary to select a simple random sample from the entire population and moreelaborate schemes can be used. The bare minimum that must be achieved is that for any individual Xvalue found in the sample, the units in the population that share this X value, must have been selected atrandom.

This is quite a relaxed assumption! For example, it allows us to deliberately choose values of X fromthe extremes and then only at those X value, randomly select from the relevant subset of the population,rather than having to select at random from the population as a whole. [This is analogous to the as-sumptions made in an analytical survey, where we assumed that even though we can’t randomly assigna treatment to a unit (e.g. we can’t assign sex to an animal) we must ensure that animals are randomlyselected from each group].

Once the data points are selected, the estimation process can proceed, but not before assessing theassumptions!

14.4.4 Assumptions

The assumptions for a regression analysis are very similar to those found in ANOVA.

Linearity

Regression analysis assume that the relationship between Y andX is linear. Make a scatter-plot betweenY and X to assess this assumption. Perhaps a transformation is required (e.g. log(Y ) vs. log(X)). Somecaution is required in transformation in dealing with the error structure as you will see in later examples.

Plot residuals vs. the X values. If the scatter is not random around 0 but shows some pattern (e.g.quadratic curve), this usually indicates that the relationship between Y and X is not linear. Or, fit a modelthat includes X and X2 and test if the coefficient associated with X2 is zero. Unfortunately, this testcould fail to detect a higher order relationship. Third, if there are multiple readings at some X-values,then a test of goodness-of-fit can be performed where the variation of the responses at the same X valueis compared to the variation around the regression line.

Correct scale of predictor and response

The response and predictor variables must both have interval or ratio scale. In particular, using a numer-ical value to represent a category and then using this numerical value in a regression is not valid. Forexample, suppose that you code hair color as (1 = red, 2 = brown, and 3 = black). Then using thesevalues in a regression either as predictor variable or as a response variable is not sensible.

Correct sampling scheme

The Y must be a random sample from the population of Y values for every X value in the sample.Fortunately, it is not necessary to have a completely random sample from the population as the regressionline is valid even if the X values are deliberately chosen. However, for a given X , the values from thepopulation must be a simple random sample.



No outliers or influential points

All the points must belong to the relationship – there should be no unusual points. The scatter-plot of Yvs. X should be examined. If in doubt, fit the model with the points in and out of the fit and see if thismakes a difference in the fit.

Outliers can have a dramatic effect on the fitted line. For example, in the following graph, the singlepoint is an outlier and and influential point:

Equal variation along the line

The variability about the regression line is similar for all values of X , i.e. the scatter of the points aboveand below the fitted line should be roughly constant over the entire line. This is assessed by looking atthe plots of the residuals against X to see if the scatter is roughly uniformly scattered around zero withno increase and no decrease in spread over the entire line.

Independence

Each value of Y is independent of any other value of Y . The most common cases where this fail are timeseries data where X is a time measurement. In these cases, time series analysis should be used.

This assumption can be assessed by again looking at residual plots against time or other variables.

Normality of errors

The difference between the value of Y and the expected value of Y is assumed to be normally dis-tributed. This is one of the most misunderstood assumptions. Many people erroneously assume that thedistribution of Y over all X values must be normally distributed, i.e they look simply at the distributionof the Y ’s ignoring the Xs. The assumption only states that the residuals, the difference between thevalue of Y and the point on the line must be normally distributed.

This can be assessed by looking at normal probability plots of the residuals. As in ANOVA, for smallsample sizes, you have little power of detecting non-normality and for large sample sizes it is not thatimportant.



X measured without error

This is a new assumption for regression as compared to ANOVA. In ANOVA, the group membershipwas always “exact”, i.e. the treatment applied to an experimental unit was known without ambiguity.However, in regression, it can turn out that that the X value may not be known exactly.

This general problem is called the “error in variables” problem and has a long history in statistics.

It turns out that there are two important cases. If the value reported for X is a nominal value and theactual value of X varies randomly around this nominal value, then there is no bias in the estimates. Thisis called the Berkson case after Berkson who first examined this situation. The most common cases arewhere the recorded X is a target value (e.g. temperature as set by a thermostat) while the actual X thatoccurs would vary randomly around this target value.

However, if the value used for X is an actual measurement of the underlying X then there is uncer-tainty in both the X and Y direction. In this case, estimates of the slope are attenuated towards zero (i.e.positive slopes are biased downwards, negative slopes biased upwards). More alarmingly, the estimateare no longer consistent, i.e. as the sample size increases, the estimates no longer tend to the populationvalues! For example, suppose that yield of a crop is related to amount of rainfall. A rain gauge may notbe located exactly at the plot where the crop is grown, but may be recorded a nearby weather station afair distance away. The reading at the weather station is NOT a true reflection of the rainfall at the testplot.

This latter case of “error in variables” is very difficult to analyze properly and there are not universallyaccepted solutions. Refer to the reference books listed at the start of this chapter for more details.

The problem is set up as follows. Let

Yi =ηi + εi

Xi =ξi + δi

with the straight-line relationship between the population (but unobserved) values:

ηi =β0 + β1ξi

Note the (population, but unknown) regression equation uses ξi rather than the observed (with error)values Xi.

Now if the regression is done on the observed X (i.e. the error prone measurement), the regressionequation reduces to:

Yi = β0 + β1Xi + (εi − β1δi)

Now this violates the independence assumption of ordinary least squares because the new “error” termis not independent of the Xi variable.

If an ordinary least squares model is fit, the estimated slope is biased (Draper and Smith, 1998, p.90) with

E[β1] = β1 −β1r(ρ+ r)

1 + 2ρr + r2

where ρ is the correlation between ξ and δ; and r is the ratio of the variance of the error in X to the errorin Y .

The bias is negative, i.e. the estimated slope is too small, in most practical cases (ρ+ r > 0). This isknown as attenuation of the estimate, and in general, pulls the estimate towards zero.



The bias will be small in the following cases:

• the error variance of X is small relative to the error variance in Y . This means that r is small (i.e.close to zero), and so the bias is also small. In the case where X is measured without error, thenr = 0 and the bias vanishes as expected.

• if the X are fixed (the Berkson case) and actually used2, then ρ+ r = 0 and the bias also vanishes.

The proper analysis of the error-in-variables case is quite complex – see Draper and Smith (1998, p.91) for more details.

14.4.5 Obtaining Estimates

To distinguish between population parameters and sample estimates, we denote the sample intercept byb0 and the sample slope as b1. The equation of a particular sample of points is expressed Yi = b0 + b1Xi

where b0 is the estimated intercept, and b1 is the estimated slope. The symbol Y indicates that we arereferring to the estimated line and not to a line in the entire population.

How is the best fitting line found when the points are scattered? We typically use the principle ofleast squares. The least-squares line is the line that makes the sum of the squares of the deviations ofthe data points from the line in the vertical direction as small as possible.

Mathematically, the least squares line is the line that minimizes 1n

∑(Yi − Yi

)2

where Yi is thepoint on the line corresponding to each X value. This is also known as the predicted value of Y for agiven value ofX . This formal definition of least squares is not that important - the concept as expressed inthe previous paragraph is more important – in particular it is the SQUARED deviation in the VERTICALdirection that is used..

It is possible to write out a formula for the estimated intercept and slope, but who cares - let thecomputer do the dirty work.

The estimated intercept (b0) is the estimated value of Y whenX = 0. In some cases, it is meaninglessto talk about values of Y whenX = 0 becauseX = 0 is nonsensical. For example, in a plot of income vs.year, it seems kind of silly to investigate income in year 0. In these cases, there is no clear interpretationof the intercept, and it merely serves as a placeholder for the line.

The estimated slope (b1) is the estimated change in Y per unit change in X . For every unit changein the horizontal direction, the fitted line increased by b1 units. If b1 is negative, the fitted line pointsdownwards, and the increase in the line is negative, i.e., actually a decrease.

As with all estimates, a measure of precision can be obtained. As before, this is the standard errorof each of the estimates. Again, there are computational formulae, but in this age of computers, theseare not important. As before, approximate 95% confidence intervals for the corresponding populationparameters are found as estimate ± 2× se.

Formal tests of hypotheses can also be done. Usually, these are only done on the slope parameteras this is typically of most interest. The null hypothesis is that population slope is 0, i.e. there is norelationship between Y andX (can you draw a scatter-plot showing such a relationship?) More formally

2For example, a thermostat measures (with error) the actual temperature of a room. But if the experiment is based on thethermostat readings rather than the (true) unknown temperature, this corresponds to the Berkson case.



the null hypothesis is:H : β1 = 0

Again notice that the null hypothesis is ALWAYS in terms of a population parameter and not in terms ofa sample statistic.

The alternate hypothesis is typically chosen as:

A : β1 6= 0

although one-sided tests looking for either a positive or negative slope are possible.

The test-statistics is found asT =

b1 − 0

se(b1)

and is compared to a t-distribution with appropriate degrees of freedom to obtain the p-value. Thisis usually automatically done by most computer packages. The p-value is interpreted in exactly thesame way as in ANOVA, i.e. is measures the probability of observing this data if the hypothesis of norelationship were true.

As before, the p-value does not tell the whole story, i.e. statistical vs. biological (non)significancemust be determined and assessed.

14.4.6 Obtaining Predictions

Once the best fitting line is found it can be used to make predictions for new values of X .

There are two types of predictions that are commonly made. It is important to distinguish betweenthem as these two intervals are the source of much confusion in regression problems.

First, the experimenter may be interested in predicting a SINGLE future individual value for a par-ticular X . Second the experimenter may be interested in predicting the AVERAGE of ALL futureresponses at a particular X .3 The prediction interval for an individual response is sometimes called aconfidence interval for an individual response but this is an unfortunate (and incorrect) use of the termconfidence interval. Strictly speaking confidence intervals are computed for fixed unknown parametervalues; predication intervals are computed for future random variables.

Both of the above intervals should be distinguished from the confidence interval for the slope.

In both cases, the estimate is found in the same manner – substitute the new value ofX into the equa-tion and compute the predicted value Y . In most computer packages this is accomplished by inserting anew “dummy” observation in the dataset with the value of Y missing, but the value of X present. Themissing Y value prevents this new observation from being used in the fitting process, but the X valueallows the package to compute an estimate for this observation.

What differs between the two predictions are the estimates of uncertainty.

In the first case, there are two sources of uncertainty involved in the prediction. First, there is theuncertainty caused by the fact that this estimated line is based upon a sample. Then there is the additionaluncertainty that the value could be above or below the predicted line. This interval is often called aprediction interval at a new X .

3There is actually a third interval, the mean of the next “m” individuals values but this is rarely encountered in practice.



In the second case, only the uncertainty caused by estimating the line based on a sample is relevant.This interval is often called a confidence interval for the mean at a new X .

The prediction interval for an individual response is typically MUCH wider than the confidenceinterval for the mean of all future responses because it must account for the uncertainty from the fittedline plus individual variation around the fitted line.

Many textbooks have the formulae for the se for the two types of predictions, but again, there islittle to be gained by examining them. What is important is that you read the documentation carefully toensure that you understand exactly what interval is being given to you.

14.4.7 Residual Plots

After the curve is fit, it is important to examine if the fitted curve is reasonable. This is done usingresiduals. The residual for a point is the difference between the observed value and the predicted value,i.e., the residual from fitting a straight line is found as: residuali = Yi − (b0 + b1Xi) = (Yi − Yi).

There are several standard residual plots:

• plot of residuals vs. predicted (Y );

• plot of residuals vs. X;

• plot of residuals vs. time ordering.

In all cases, the residual plots should show random scatter around zero with no obvious pattern.Don’t plot residual vs. Y - this will lead to odd looking plots which are an artifact of the plot and don’tmean anything.

14.4.8 Example - Yield and fertilizer

We wish to investigate the relationship between yield (Liters) and fertilizer (kg/ha) for tomato plants. Anexperiment was conducted in the Schwarz household one summer on 11 plots of land where the amountof fertilizer was varied and the yield measured at the end of the season.

The amount of fertilizer (randomly) applied to each plot was chosen between 5 and 18 kg/ha. Whilethe levels were not systematically chosen (e.g. they were not evenly spaced between the highest andlowest values), they represent commonly used amounts based on a preliminary survey of producers. Atthe end of the experiment, the yields were measured and the following data were obtained.

Interest also lies in predicting the yield when 16 kg/ha are assigned.



Fertilizer Yield(kg/ha) (Liters)

12 24

5 18

15 31

17 33

14 30

6 20

11 25

13 27

15 31

8 21

18 29

The data is available in the fertilizer.csv file in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets.

The data are imported into SAS in the usual fashion:

data tomato;infile ’fertilizer.csv’ dlm=’,’ dsd missover firstobs=2;input fertilizer yield;

run;;

Note that both variables are numeric (SAS doesn’t have the concept of scale of variables) and thatan extra row was added to the data table with the value of 16 for the fertilizer and the yield left missing.The ordering of the rows is NOT important; however, it is often easier to find individual data points ifthe data is sorted by the X value and the rows for future predictions are placed at the end of the dataset.

The raw data are shown below:

Obs fertilizer yield

1 5 18

2 6 20

3 8 21

4 11 25

5 12 24

6 13 27

7 14 30

8 15 31

9 15 31

10 17 33

11 18 29

12 16 .


http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets


In in this study, it is quite clear that the fertilizer is the predictor (X) variable, while the responsevariable (Y ) is the yield.

The population consists of all possible field plots with all possible tomato plants of this type grownunder all possible fertilizer levels between about 5 and 18 kg/ha.

If all of the population could be measured (which it can’t) you could find a relationship betweenthe yield and the amount of fertilizer applied. This relationship would have the form: Y = β0 + β1 ×(amount of fertilizer) + ε where β0 and β1 represent the population intercept and population slope re-spectively. The term ε represents random variation that is always present, i.e. even if the same plot wasgrown twice in a row with the same amount of fertilizer, the yield would not be identical (why?).

The population parameters to be estimated are β0 - the population average yield when the amount offertilizer is 0, and β1, the population average change in yield per unit change in the amount of fertilizer.These are taken over all plants in all possible field plots of this type. The values of β0 and β1 areimpossible to obtain as the entire population could never be measured.

The ordering of the rows in the data table is NOT important; however, it is often easier to findindividual data points if the data is sorted by the X value and the rows for future predictions are placedat the end of the dataset. Notice how missing values are represented.

Start by plotting the data using Proc SGplot:

proc sgplot data=tomato;title2 ’Preliminary data plot’;scatter y=yield x=fertilizer / markerattrs=(symbol=circlefilled);yaxis label=’Yield’ offsetmin=.05 offsetmax=.05;xaxis label=’Fertilizer’ offsetmin=.05 offsetmax=.05;

run;



The relationship look approximately linear; there don’t appear to be any outlier or influential points;the scatter appears to be roughly equal across the entire regression line. Residual plots will be used laterto check these assumptions in more detail.

We use Proc Reg to fit the regression model:

ods graphics on;proc reg data=tomato plots=all;

title2 ’fit the model’;model yield = fertilizer / all;ods output OutputStatistics =Predictions;ods output ParameterEstimates=Estimates;

run;ods graphics off;

The model statement is what tells SAS that the response variable is yield because it appears to theleft of the equal sign, and that the predictor variable is fertilizer because it appears to the right of theequal sign. The all options requests much output, and part of the output will be discussed below.The odsstatement requests that some output statistics are placed into a dataset called predictions.

Part of the output include the coefficients and their standard errors:



Variable DFParameterEstimate

StandardError

tValue

Pr>|t|

Lower95%CLParameter

Upper95%CLParameter

Intercept 1 12.85602 1.69378 7.59 <.0001 9.02442 16.68763

fertilizer 1 1.10137 0.13175 8.36 <.0001 0.80333 1.39941

The estimated regression line is

Y = b0 + b1(fertilizer) = 12.856 + 1.10137(amount of fertilizer)

In terms of estimates, b0=12.856 is the estimated intercept, and b1=1.101 is the estimated slope.

The estimated slope is the estimated change in yield when the amount of fertilizer is increased by 1unit. In this case, the yield is expected to increase (why?) by 1.10137 L when the fertilizer amount isincreased by 1 kg/ha. NOTE that the slope is the CHANGE in Y when X increases by 1 unit - not thevalue of Y when X = 1.

The estimated intercept is the estimated yield when the amount of fertilizer is 0. In this case, the esti-mated yield when no fertilizer is added is 12.856 L. In this particular case the intercept has a meaningfulinterpretation, but I’d be worried about extrapolating outside the range of the observed X values. If theintercept is 12.85, why does the line intersect the left part of the graph at about 15 rather than closer to13?

Once again, these are the results from a single experiment. If another experiment was repeated, youwould obtain different estimates (b0 and b1 would change). The sampling distribution over all possi-ble experiments would describe the variation in b0 and b1 over all possible experiments. The standarddeviation of b0 and b1 over all possible experiments is again referred to as the standard error of b0 andb1.

The formulae for the standard errors of b0 and b1 are messy, and hopeless to compute by hand.And just like inference for a mean or a proportion, the program automatically computes the se of theregression estimates.

The estimated standard error for b1 (the estimated slope) is 0.132 L/kg. This is an estimate of thestandard deviation of b1 over all possible experiments. Normally, the intercept is of limited interest, buta standard error can also be found for it as shown in the above table.

Using exactly the same logic as when we found a confidence interval for the population mean, or forthe population proportion, a confidence interval for the population slope (β1) is found (approximately)as b1 ± 2(estimated se) In the above example, an approximate confidence interval for β1 is found as

1.101± 2× (.132) = 1.101± .264 = (.837 → 1.365)L/kg

of fertilizer applied.

The “exact” confidence interval is based on the t-distribution and is slightly wider than our approx-imate confidence interval because the total sample size (11 pairs of points) is rather small. We interpretthis interval as ‘being 95% confident that the population increase in yield when the amount of fertilizeris increased by one unit is somewhere between (.837 to 1.365) L/kg.’

Be sure to carefully distinguish between β1 and b1. Note that the confidence interval is computedusing b1, but is a confidence interval for β1 - the population parameter that is unknown.



In linear regression problems, one hypothesis of interest is if the population slope is zero. This wouldcorrespond to no linear relationship between the response and predictor variable (why?) Again, this isa good time to read the papers by Cherry and Johnson about the dangers of uncritical use of hypothesistesting. In many cases, a confidence interval tells the entire story.

The test of hypothesis about the intercept is not of interest (why?).

Let

• β1 be the population (unknown) slope.

• b1 be the estimated slope. In this case b1 = 1.1014.

The hypothesis testing proceeds as follows. Again note that we are interested in the populationparameters and not the sample statistics.

1. Specify the null and alternate hypothesis:H: β1 = 0

A: β1 6= 0.

Notice that the null hypothesis is in terms of the population parameter β1. This is a two-sided testas we are interested in detecting differences from zero in either direction.

2. Find the test statistic and the p-value. The test statistic is computed as:

T =estimate − hypothesized value

estimated se=

1.1014− 0

.132= 8.36

In other words, the estimate is over 8 standard errors away from the hypothesized value!

This will be compared to a t-distribution with n−2 = 9 degrees of freedom. The p-value is foundto very small (less than 0.0001).

3. Conclusion. There is strong evidence that the population slope is not zero. This is not too surpris-ing given that the 95% confidence intervals show that plausible values for the population slope arefrom about .8 to about 1.4.

It is possible to construct tests of the slope equal to some value other than 0. Most packages can’t dothis. You would compute the T value as shown above, replacing the value 0 with the hypothesized value.

It is also possible to construct one-sided tests. Most computer packages only do two-sided tests.Proceed as above, but the one-sided p-value is the two-sided p-value reported by the packages dividedby 2.

If sufficient evidence is found against the hypothesis, a natural question to ask is ‘well, what valuesof the parameter are plausible given this data’. This is exactly what a confidence interval tells you.Consequently, I usually prefer to find confidence intervals, rather than doing formal hypothesis testing.

What about making predictions for future yields when certain amounts of fertilizer are applied? Forexample, what would be the future yield when 16 kg/ha of fertilizer are applied?

The predicted value is found by substituting the new X into the estimated regression line.

Y = b0 + b1(fertilizer) = 12.856 + 1.10137(16) = 30.48 L



Predictions for existing and new X values are obtained automatically with the All option on themodel statement and can be output to a dataset using the ODS statement and appropriate table name andthen merging it back with the original data:

data predictions; /* merge the original data set with the predictions */merge tomato predictions;

run;

Here is a listing of the prediction dataset.

Obs fertilizer yieldPredictedValue

StdErrMeanPredict

Lower95%CLMean

Upper95%CLMean

Lower95%CLPredict

Upper95%CLPredict

1 5 18 18.3629 1.0901 15.8969 20.8288 13.6120 23.1138

2 6 20 19.4643 0.9779 17.2521 21.6764 14.8400 24.0885

3 8 21 21.6670 0.7723 19.9198 23.4141 17.2463 26.0877

4 11 25 24.9711 0.5632 23.6971 26.2451 20.7151 29.2271

5 12 24 26.0725 0.5418 24.8469 27.2981 21.8308 30.3142

6 13 27 27.1738 0.5519 25.9254 28.4223 22.9255 31.4222

7 14 30 28.2752 0.5919 26.9363 29.6142 23.9994 32.5511

8 15 31 29.3766 0.6564 27.8918 30.8614 25.0529 33.7003

9 15 31 29.3766 0.6564 27.8918 30.8614 25.0529 33.7003

10 17 33 31.5793 0.8342 29.6922 33.4665 27.1015 36.0572

11 18 29 32.6807 0.9384 30.5579 34.8035 28.0985 37.2629

12 16 . 30.4780 0.7389 28.8064 32.1495 26.0866 34.8693

As noted earlier, there are two types of estimates of precision associated with predictions using theregression line. It is important to distinguish between them as these two intervals are the source of muchconfusion in regression problems.

First, the experimenter may be interested in predicting a single FUTURE individual value for aparticular X . This would correspond to the predicted yield for a single future plot with 16 kg/ha offertilizer added.

Second the experimenter may be interested in predicting the average of ALL FUTURE responses ata particular X . This would correspond to the average yield for all future plots when 16 kg/ha of fertilizeris added. The prediction interval for an individual response is sometimes called a confidence intervalfor an individual response but this is an unfortunate (and incorrect) use of the term confidence interval.Strictly speaking confidence intervals are computed for fixed unknown parameter values; predicationintervals are computed for future random variables.

Both of these intervals and the fitted line can be plotted using Proc SGplot:

proc sgplot data=Predictions;title2 ’Fitted line and confidence curves for mean and individual values’;



band x=fertilizer lower=lowerCL upper=upperCL;band x=fertilizer lower=lowerCLmean upper=upperCLmean / fillattrs=(color=red);series y=PredictedValue X=Fertilizer;scatter y=yield x=fertilizer / markerattrs=(symbol=circlefilled);yaxis label=’Yield’ offsetmin=.05 offsetmax=.05;xaxis label=’Fertilizer’ offsetmin=.05 offsetmax=.05;

run;

Later versions of SAS also produce these plots using the ODS Graphics options:



The innermost set of lines represents the confidence bands for the mean response. The outermostband of lines represents the prediction intervals for a single future response. As noted earlier, the lattermust be wider than the former to account for an additional source of variation.

Here the predicted yield for a single future trial at 16 kg/ha is 30.5 L, but the 95% prediction intervalis between 26.1 and 34.9 L. The predicted AVERAGE yield for ALL future plots when 16 kg/ha offertilizer is applied is also 30.5 L, but the 95% confidence interval for the MEAN yield is between 28.8and 32.1 L.

Residual (and other diagnostic plots to assess the fit of the model) are automatically produced usingthe ODS Graphics option prior to invoking Proc Reg:



There is no evidence of any problems except perhaps for some excess leverage for the last observationas measured by Cook’s D statistic.

The residuals are simply the difference between the actual data point and the corresponding spot onthe line measured in the vertical direction. The residual plot shows no trend in the scatter around thevalue of zero.

14.4.9 Example - Mercury pollution

Mercury pollution is a serious problem in some waterways. Mercury levels often increase after a lakeis flooded due to leaching of naturally occurring mercury by the higher levels of the water. Excessiveconsumption of mercury is well known to be deleterious to human health. It is difficult and time con-suming to measure every persons mercury level. It would be nice to have a quick procedure that couldbe used to estimate the mercury level of a person based upon the average mercury level found in fishand estimates of the person’s consumption of fish in their diet. The following data were collected onthe methyl mercury intake of subjects and the actual mercury levels recorded in the blood stream from arandom sample of people around recently flooded lakes.



Here are the raw data:

Methyl Mercury Mercury in

Intake whole blood

(ug Hg/day) (ng/g)

180 90

200 120

230 125

410 290

600 310

550 290

275 170

580 375

600 150

105 70

250 105

60 205

650 480

The data is available in the mercury.csv file in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets.


data mercury;infile ’mercury.csv’ dlm=’,’ dsd missover firstobs=2;input intake blood;plot_symbol=1;if intake = 60 then plot_symbol=2; /* identify two potential outliers */if intake = 600 and blood <200 then plot_symbol=2;

run;

Note that both variables are numeric (SAS doesn’t have the concept of scale of variables). I create anew variable plot_symbol based on the analysis that follows to illustrate the presence of potential outliers.The raw data are shown below:

Obs intake blood plot_symbol

1 180 90 1

2 200 120 1

3 230 125 1

4 410 290 1

5 600 310 1

6 550 290 1

7 275 170 1

8 580 375 1





Obs intake blood plot_symbol

9 600 150 2

10 105 70 1

11 250 105 1

12 60 205 2

13 650 480 1

The ordering of the rows in the data table is NOT important; however, it is often easier to findindividual data points if the data is sorted by the X value and the rows for future predictions are placedat the end of the dataset. Notice how missing values are represented.

The population of interest are the people around recently flooded lakes.

This experiment is an analytical survey as it is quite impossible to randomly assign people differentamounts of mercury in their food intake. Consequently, the key assumption is that the subjects chosento be measured are random samples from those with similar mercury intakes. Note it is NOT necessaryfor this to be a random sample from the ENTIRE population (why?).

The explanatory variable is the amount of mercury ingested by a person. The response variable is theamount of mercury in the blood stream.

Start by plotting the data using Proc SGplot. Notice the use of the Proc Template procedure to assigndifferent plotting symbols to the points depending on the value of the plot_symbol variable defined whenthe data was read.

proc template;Define style styles.mystyle;Parent=styles.default;Style graphdata1 from graphdata1 / MarkerSymbol="CircleFilled" Color=black Contrastcolor=black;Style graphdata2 from graphdata2 / MarkerSymbol="X" Color=black Contrastcolor=black;end;

run;ods html style=mystyle;proc sgplot data=mercury;

title2 ’Preliminary data plot’;scatter y=blood x=intake / group=plot_symbol markerattrs=(size=10px);yaxis label=’Blood Mercury’ offsetmin=.05 offsetmax=.05;xaxis label=’Intake Mercury’ offsetmin=.05 offsetmax=.05;

run;



There appears to be two outliers (identified by an X). To illustrate the effects of these outliers uponthe estimates and the residual plots, the line was fit using all of the data.


ods graphics on;proc reg data=mercury plots=all;

title2 ’Fit the model to all of the data’;model blood = intake / all;ods output OutputStatistics =Predictions;ods output ParameterEstimates=Estimates;


The model statement is what tells SAS that the response variable is the blood mercury level because itappears to the left of the equal sign, and that the predictor variable is food intake mercury level becauseit appears to the right of the equal sign. The all options requests much output, and part of the output willbe discussed below.The ods statement requests that some output statistics are placed into a dataset calledpredictions.





StandardError

tValue

Pr>|t|

Lower95%CLParameter

Upper95%CLParameter

Intercept 1 12.85602 1.69378 7.59 <.0001 9.02442 16.68763

fertilizer 1 1.10137 0.13175 8.36 <.0001 0.80333 1.39941


The residual plot shows the clear presence of the two outliers, but also identifies a third potentialoutlier not evident from the original scatter-plot (can you find it?).

The data were rechecked and it appears that there was an error in the blood work used in determiningthe readings. Consequently, these points were removed for the subsequent fit.

We remove the outliers:



data mercury2; /* delete the outliers */set predictions;if plot_symbol ^= 1 then delete;if residual > 100 then delete;keep blood intake;

run;

The revised raw data are shown below:

Obs intake blood

1 180 90

2 200 120

3 230 125

4 410 290

5 600 310

6 550 290

7 275 170

8 580 375

9 105 70

10 250 105

We use Proc Reg to again fit the data as shown previously. Part of the output include the coefficientsand their standard errors:


StandardError

tValue

Pr>|t|

Lower95%CLParameter

Upper95%CLParameter

Intercept 1 -1.95169 22.71513 -0.09 0.9336 -54.33288 50.42950

intake 1 0.58122 0.05983 9.71 <.0001 0.44325 0.71919

The estimated regression line (after removing outliers) is

Blood = −1.951691 + 0.581218Intake

.

The estimated slope of 0.58 indicates that the mercury level in the blood increases by 0.58 ng/daywhen the intake level in the food is increased by 1 ug/day. The intercept has no really meaning in thecontext of this experiment. The negative value is merely a placeholder for the line. Also notice that theestimated intercept is not very precise in any case (how do I know this and what implications does thishave for worrying that it is not zero?)4

What was the impact of the outliers if they had been retained upon the estimated slope and intercept?

4It is possible to fit a regression line that is constrained to go through Y = 0 when X = 0. These must be fit carefully and arenot covered in this course.



The estimated slope has been determined relatively well (relative standard error of about 10% – howis the relative standard error computed?). There is clear evidence that the hypothesis of no relationshipbetween blood mercury levels and food mercury levels is not tenable.

The two types of predictions would also be of interest in this study. First, an individual would like toknow the impact upon personal health. Secondly, the average level would be of interest to public healthauthorities.

Predictions for existing and new X values are obtained automatically with the All option on themodel statement and can be output to a dataset using the ODS statement and appropriate table name andthen merging it back with the original data:

data predictions2; /* merge the original data set with the predictions */merge mercury2 predictions2;

run;


Obs intake bloodPredictedValue

StdErrMeanPredict

Lower95%CLMean

Upper95%CLMean

Lower95%CLPredict

Upper95%CLPredict

1 180 90 102.6676 14.0140 70.3511 134.9840 20.5945 184.7406

2 200 120 114.2919 13.2364 83.7687 144.8151 32.9083 195.6756

3 230 125 131.7285 12.1977 103.6004 159.8565 51.2125 212.2444

4 410 290 236.3477 11.2067 210.5051 262.1903 156.6014 316.0940

5 600 310 346.7791 18.7816 303.4687 390.0896 259.7881 433.7701

6 550 290 317.7182 16.3680 279.9734 355.4630 233.3600 402.0764

7 275 170 157.8833 11.0109 132.4921 183.2745 78.2821 237.4844

8 580 375 335.1548 17.7951 294.1191 376.1904 249.2737 421.0358

9 105 70 59.0762 17.3598 19.0443 99.1081 -26.3298 144.4822

10 250 105 143.3528 11.6083 116.5840 170.1216 63.3016 223.4041

and the two intervals can be plotted on the same graph in a similar fashion as in the Fertilizer example



giving:




Residual (and other diagnostic plots to assess the fit of the model) are automatically produced usingthe ODS Graphics option prior to invoking Proc Reg: and now show no problems.

14.4.10 Example - The Anscombe Data Set

Anscombe (1973, American Statistician 27, 17-21) created a set of 4 data sets that were quite remarkable.All four datasets gave exactly the same results when a regression line was fit, yet are quite different intheir interpretation.

The Anscombe data is available at the http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets.Fitting of regression lines to this data will be demonstrated in class.




14.4.11 Transformations

In some cases, the plot of Y vs.X is obviously non-linear and a transformation ofX or Y may be used toestablish linearity. For example, many dose-response curves are linear in log(X). Or the equation maybe intrinsically non-linear, e.g. a weight-length relationship is of the form weight = β0length

β1 . Or,some variables may be recorded in an arbitrary scale, e.g. should the fuel efficiency of a car be measuredin L/100 km or km/L? You are already with some variables measured on the log-scale - pH is a commonexample.

Often a visual inspection of a plot may identify the appropriate transformation.

There is no theoretical difficulty in fitting a linear regression using transformed variables other thanan understanding of the implicit assumption of the error structure. The model for a fit on transformeddata is of the form

trans(Y ) = β0 + β1 × trans(X) + error

Note that the error is assumed to act additively on the transformed scale. All of the assumptions oflinear regression are assumed to act on the transformed scale – in particular that the population standarddeviation around the regression line is constant on the transformed scale.

The most common transformation is the logarithmic transform. It doesn’t matter if the natural log-arithm (often called the ln function) or the common logarithm transformation (often called the log10

transformation) is used. There is a 1-1 relationship between the two transformations, and linearity onone transform is preserved on the other transform. The only change is that values on the ln scale are2.302 = ln(10) times that on the log10 scale which implies that the estimated slope and intercept bothdiffer by a factor of 2.302. There is some confusion in scientific papers about the meaning of log - somepapers use this to refer to the ln transformation, while others use this to refer to the log10 transformation.

After the regression model is fit, remember to interpret the estimates of slope and intercept on thetransformed scale. For example, suppose that a ln(Y ) transformation is used. Then we have

ln(Yt+1) = b0 + b1 × (t+ 1)

ln(Yt) = b0 + b1 × tand

ln(Yt+1)− ln(Yt) = ln(Yt+1

Yt) = b1 × (t+ 1− t) = b1

.exp(ln(

Yt+1

Yt)) =

Yt+1

Yt= exp(b1) = eb1

Hence a one unit increase in X causes Y to be MULTIPLED by eb1 . As an example, suppose that onthe log-scale, that the estimated slope was −.07. Then every unit change in X causes Y to change by amultiplicative factor or e−.07 = .93, i.e. roughly a 7% decline per year.5

Similarly, predictions on the transformed scale, must be back-transformed to the untransformed scale.

In some problems, scientists search for the ‘best’ transform. This is not an easy task and using simplestatistics such as R2 to search for the best transformation should be avoided. Seek help if you need tofind the best transformation for a particular dataset.

JMP makes it particularly easy to fit regressions to transformed data as shown below. SAS and Rhave an extensive array of functions so that you can create new variables based the transformation of anexisting variable.

5It can be shown that on the log scale, that for smallish values of the slope that the change is almost the same on the untrans-formed scale, i.e. if the slope is −.07 on the log sale, this implies roughly a 7% decline per year; a slope of +.07 implies roughlya 7% increase per year.



14.4.12 Example: Monitoring Dioxins - transformation

An unfortunate byproduct of pulp-and-paper production used to be dioxins - a very hazardous material.This material was discharged into waterways with the pulp-and-paper effluent where it bioaccumulatedin living organisms such a crabs. Newer processes have eliminated this by product, but the dioxins in theorganisms takes a long time to degrade.

Government environmental protection agencies take samples of crabs from affected areas each yearand measure the amount of dioxins in the tissue. The following example is based on a real study.

Each year, four crabs are captured from a monitoring station. The liver is excised and the livers fromall four crabs are composited together into a single sample.6 The dioxins levels in this composite sampleis measured. As there are many different forms of dioxins with different toxicities, a summary measure,called the Total Equivalent Dose (TEQ) is computed from the sample.

Here is the raw data.

Site Year TEQ

a 1990 179.05

a 1991 82.39

a 1992 130.18

a 1993 97.06

a 1994 49.34

a 1995 57.05

a 1996 57.41

a 1997 29.94

a 1998 48.48

a 1999 49.67

a 2000 34.25

a 2001 59.28

a 2002 34.92

a 2003 28.16

The data is available in the dioxinTEQ.csv file in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets.


data teq;infile "dioxinTEQ.csv" dlm=’,’ dsd missover firstobs=2;input site $ year TEQ;logTEQ = log(TEQ); /* compute the log TEQ values */attrib logTEQ label=’log(TEQ)’ format=7.2;

run;

6 Compositing is a common analytical tool. There is little loss of useful information induced by the compositing process - theonly loss of information is the among individual-sample variability which can be used to determine the optimal allocation betweensamples within years and the number of years to monitor.





Note that both variables are numeric (SAS doesn’t have the concept of scale of variables) and thatan extra row was added to the data table with the value of 2010 for the year and the TEQ left missing.The ordering of the rows is NOT important; however, it is often easier to find individual data points ifthe data is sorted by the X value and the rows for future predictions are placed at the end of the dataset.The value of ln(TEQ) is computed in the data step.

The raw data are shown below:

Obs site year TEQ log(TEQ)

1 a 1990 179.05 5.19

2 a 1991 82.39 4.41

3 a 1992 130.18 4.87

4 a 1993 97.06 4.58

5 a 1994 49.34 3.90

6 a 1995 57.05 4.04

7 a 1996 57.41 4.05

8 a 1997 29.94 3.40

9 a 1998 48.48 3.88

10 a 1999 49.67 3.91

11 a 2000 34.25 3.53

12 a 2001 59.28 4.08

13 a 2002 34.92 3.55

14 a 2003 28.16 3.34

15 a 2010 . .

As with all analyses, start with a preliminary plot of the data. We use Proc SGplot to get a scatterplot:

proc sgplot data=teq;title2 ’Preliminary data plot’;scatter y=TEQ x=year / markerattrs=(symbol=circlefilled);yaxis label=’TEQ’ offsetmin=.05 offsetmax=.05;xaxis label=’Year’ offsetmin=.05 offsetmax=.05;

run;



The preliminary plot of the data shows a decline in levels over time, but it is clearly non-linear. Whyis this so? In many cases, a fixed fraction of dioxins degrades per year, e.g. a 10% decline per year. Thiscan be expressed in a non-linear relationship:

TEQ = Crt

where C is the initial concentration, r is the rate reduction per year, and t is the elapsed time. If this isplotted over time, this leads to the non-linear pattern seen above.

If logarithms are taken, this leads to the relationship:

log(TEQ) = log(C) + t× log(r)

which can be expressed as:log(TEQ) = β0 + β1 × t

which is the equation of a straight line with β0 = log(C) and β1 = log(r).

The log(TEQ) was computed in the dataset seen earlier.

A plot of log(TEQ) vs. year gives the following:



The relationship look approximately linear; there don’t appear to be any outlier or influential points;the scatter appears to be roughly equal across the entire regression line. Residual plots will be used laterto check these assumptions in more detail.


ods graphics on;proc reg data=teq plots=all;

title2 ’fit the model’;model logTEQ = year / all;ods output OutputStatistics =Predictions;ods output ParameterEstimates=Estimates;


The model statement is what tells SAS that the response variable is logTEQ because it appears tothe left of the equal sign, and that the predictor variable is year because it appears to the right of theequal sign. The all options requests much output, and part of the output will be discussed below.The odsstatement requests that some output statistics are placed into a dataset called predictions.





StandardError

tValue

Pr>|t|

Lower95%CLParameter

Upper95%CLParameter

Intercept 1 218.91364 42.79187 5.12 0.0003 125.67816 312.14911

year 1 -0.10762 0.02143 -5.02 0.0003 -0.15432 -0.06092

The fitted line is:log(TEQ) = 218.9− .11(year)

.

The intercept (218.9) would be the log(TEQ) in the year 0 which is clearly nonsensical. The slope(−.11) is the estimated log(ratio) from one year to the next. For example, exp(−.11) = .898 wouldmean that the TEQ in one year is only 89.8% of the TEQ in the previous year or roughly an 11% declineper year.7

The standard error of the estimated slope is .02. If you want to find the standard error of the anti-logof the estimated slope, you DO NOT take exp(0.02). Rather, the standard error of the ant-logged valueis found as seantilog = selog exp(slope) = 0.02× .898 = .01796.8

The 95% confidence interval for the slope is given automatically by SAS when the clb option isspecified on the model statement.

The 95% confidence interval for the slope (on the log-scale) is (−.154 to −.061). If you take theanti-logs of the endpoints, this gives a 95% confidence interval for the fraction of TEQ that remains fromyear to year, i.e. between (0.86 to 0.94) of the TEQ in one year, remains to the next year.

As always, the model diagnostics should be inspected early on in the process: Residual (and otherdiagnostic plots to assess the fit of the model) are automatically produced using the ODS Graphics optionprior to invoking Proc Reg:

7 It can be shown that in regressions using a log(Y ) vs. time, that the estimated slope on the logarithmic scale is the approximatefraction decline per time interval. For example, in the above, the estimated slope of −.11 corresponds to an approximate 11%decline per year. This approximation only works well when the slopes are small, i.e. close to zero.

8 This is computed using a method called the delta-method.



The residual plot looks fine with no apparent problems but the dip in the middle years could requirefurther exploration if this pattern was apparent in other sites as well. This type of pattern may be evidenceof autocorrelation.

Here there is no evidence of auto-correlation so we can proceed without worries.

Several types of predictions can be made. For example, what would be the estimated mean logTEQin 2010? What is the range of logTEQ’s in 2010? Again, refer back to previous chapters about thedifferences in predicting a mean response and predicting an individual response.

This is accomplished by adding a dummy observation to the existing data set9. Each of the dummyobservations should have the time for which predictions are to be made present, but no Y value. Suchobservations will be ignored during the fitting process, but predictions are made by requesting the p, clm,cli options on the model statement.

Predictions for existing and new X values are obtained automatically with the All option on themodel statement and can be output to a dataset using the ODS statement and appropriate table name and

9 These can be placed anywhere, but are traditionally placed at the end of the dataset



then merging it back with the original data:

data predictions; /* merge the original data set with the predictions */merge teq predictions;

run;


Obs year TEQ log(TEQ)PredictedValue

StdErrMeanPredict

Lower95%CLMean

Upper95%CLMean

Lower95%CLPredict

Upper95%CLPredict

1 1990 179.05 5.19 4.7516 0.1639 4.3944 5.1088 3.9618 5.5413

2 1991 82.39 4.41 4.6440 0.1462 4.3255 4.9624 3.8710 5.4170

3 1992 130.18 4.87 4.5364 0.1295 4.2542 4.8185 3.7776 5.2951

4 1993 97.06 4.58 4.4287 0.1144 4.1794 4.6780 3.6815 5.1759

5 1994 49.34 3.90 4.3211 0.1017 4.0996 4.5426 3.5827 5.0595

6 1995 57.05 4.04 4.2135 0.0922 4.0126 4.4144 3.4810 4.9459

7 1996 57.41 4.05 4.1059 0.0871 3.9162 4.2956 3.3764 4.8353

8 1997 29.94 3.40 3.9983 0.0871 3.8086 4.1880 3.2688 4.7277

9 1998 48.48 3.88 3.8906 0.0922 3.6898 4.0915 3.1582 4.6231

10 1999 49.67 3.91 3.7830 0.1017 3.5615 4.0045 3.0446 4.5214

11 2000 34.25 3.53 3.6754 0.1144 3.4261 3.9247 2.9282 4.4226

12 2001 59.28 4.08 3.5678 0.1295 3.2856 3.8499 2.8090 4.3266

13 2002 34.92 3.55 3.4602 0.1462 3.1417 3.7786 2.6871 4.2332

14 2003 28.16 3.34 3.3525 0.1639 2.9954 3.7097 2.5628 4.1423

15 2010 . . 2.5992 0.3020 1.9413 3.2572 1.6353 3.5631

Both of these intervals and the fitted line can be plotted using Proc SGplot:

proc sgplot data=Predictions;title2 ’Fitted line and confidence curves for mean and individual values’;band x=Year lower=lowerCL upper=upperCL;band x=Year lower=lowerCLmean upper=upperCLmean / fillattrs=(color=red);series y=PredictedValue X=Year;scatter y=logTEQ x=Year / markerattrs=(symbol=circlefilled);yaxis label=’logTEQ’ offsetmin=.05 offsetmax=.05;xaxis label=’Year’ offsetmin=.05 offsetmax=.05;

run;



The estimated mean log(TEQ) in 2010 is 2.60 (corresponding to an estimated MEDIAN TEQ ofexp(2.60) = 13.46). A 95% confidence interval for the mean log(TEQ) is (1.94 to 3.26) correspondingto a 95% confidence interval for the actual MEDIAN TEQ of between (6.96 and 26.05).10 Note that theconfidence interval after taking anti-logs is no longer symmetrical.

Why does a mean of a logarithm transform back to the median on the untransformed scale? Basically,because the transformation is non-linear, properties such mean and standard errors cannot be simplyanti-transformed without introducing some bias. However, measures of location, (such as a median) areunaffected. On the transformed scale, it is assumed that the sampling distribution about the estimate issymmetrical which makes the mean and median take the same value. So what really is happening is thatthe median on the transformed scale is back-transformed to the median on the untransformed scale.

Similarly, a 95% prediction interval for the log(TEQ) for an INDIVIDUAL composite sample canbe found. Be sure to understand the difference between the two intervals.

Finally, an inverse prediction is sometimes of interest, i.e. in what year, will the TEQ be equal to someparticular value. For example, health regulations may require that the TEQ of the composite sample bebelow 10 units.

Rather surprisingly, SAS does NOT have a function for inverse regression. A few people have written“one-off” functions, but the code needs to checked carefully. For this class, I would plot the confidenceand prediction intervals, and then work ’backward’ from the target Y value to see where it hits theconfidence limits and then drop down to the X axis:

We compute the predicted values for a wide range of X values and get the plot of the two intervals andthen follow the example above. Note the use of the Ref command in Proc SGplot to get the horizontalreference line at Y = 2.302.

proc sgplot data=Predictions;

10 A minor correction can be applied to estimate the mean if required.



title2 ’Demonstrating how to do inverse predictions at logTEQ=2.302’;band x=year lower=lowerCL upper=upperCL;band x=year lower=lowerCLmean upper=upperCLmean / fillattrs=(color=red);series y=PredictedValue X=Year;scatter y=logTEQ x=Year / markerattrs=(symbol=circlefilled);refline 2.302 / axis=y ;yaxis label=’logTEQ’ offsetmin=.05 offsetmax=.05;xaxis label=’Year’ offsetmin=.05 offsetmax=.05;

run;

The predicted year is found by solving

2.302 = 218.9− .11(year)

and gives and estimated year of 2012.7. A confidence interval for the time when the mean log(TEQ) isequal to log(10) is somewhere between 2007 and 2026!

The application of regression to non-linear problems is fairly straightforward after the transformationis made. The most error-prone step of the process is the interpretation of the estimates on the TRANS-FORMED scale and how these relate to the untransformed scale.

14.4.13 Example: Weight-length relationships - transformation

A common technique in fisheries management is to investigate the relationship between weight andlengths of fish.



This is expected to a non-linear relationship because as fish get longer, they also get wider and thicker.If a fish grew “equally” in all directions, then the weight of a fish should be proportional to the length3

(why?). However, fish do not grow equally in all directions, i.e. a doubling of length is not necessarilyassociated with a doubling of width or thickness. The pattern of association of weight with length mayreveal information on how fish grow.

The traditional model between weight and length is often postulated to be of the form:

weight = a× lengthb

where a and b are unknown constants to be estimated from data.

If the estimated value of b is much less than 3, this indicates that as fish get longer, they do not getwider and thicker at the same rates.

How are such models fit? If logarithms are taken on each side, the above equation is transformed to:

log(weight) = log(a) + b× log(length)

orlog(weight) = β0 + β1 × log(length)

where the usual linear relationship on the log-scale is now apparent.

The following example was provided by Randy Zemlak of the British Columbia Ministry of Water,Land, and Air Protection.



Length (mm) Weight (g)

34 585

46 1941

33 462

36 511

32 428

33 396

34 527

34 485

33 453

44 1426

35 488

34 511

32 403

31 379

30 319

33 483

36 600

35 532

29 326

34 507

32 414

33 432

33 462

35 566

34 454

35 600

29 336

31 451

33 474

32 480

35 474

30 330

30 376

34 523

31 353

32 412

32 407

The data is available in the wtlen.csv file in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets.






data wtlen;infile ’wtlen.csv’ dlm=’,’ dsd missover firstobs=2 ;input length weight;

run;

Note that both variables are numeric (SAS doesn’t have the concept of scale of variables) and thatan extra row was added to the data table with the value of 16 for the fertilizer and the yield left missing.The ordering of the rows is NOT important; however, it is often easier to find individual data points ifthe data is sorted by the X value and the rows for future predictions are placed at the end of the dataset.

Part of the raw data are shown below:

Obs length weight log_length log_weight

5 28.9 336 3.4 5.8

7 29.3 326 3.4 5.8

8 29.5 319 3.4 5.8

10 30.2 330 3.4 5.8

11 30.5 376 3.4 5.9

12 31.0 353 3.4 5.9

14 31.3 379 3.4 5.9

15 31.4 451 3.4 6.1

16 31.5 403 3.4 6.0

17 31.7 414 3.5 6.0

We create some additional plotting positions and compute the the log(weight) and log(length) areadded to the dataset. Note that the log() function is the natural logarithm (base e) function.

data wtlen2; /* create some plotting points */do length = 25 to 50 by 1;

weight = .;output;

end;run;

data wtlen; /* append the plotting points and compute log() transformation */set wtlen wtlen2;log_length = log(length);log_weight = log(weight);format log_length log_weight 7.1;

run;

Start by plotting the data using Proc SGplot and add a lowess fit to the points:

proc sgplot data=wtlen;title2 ’Preliminary data plot’;



scatter y=weight x=length / markerattrs=(symbol=circlefilled);loess y=weight x=length;yaxis label=’Weight’ offsetmin=.05 offsetmax=.05;xaxis label=’Length’ offsetmin=.05 offsetmax=.05;

run;

The fit appears to be non-linear but this may simply be an artifact of the influence of the two largestfish. The plot appears to be linear in the area of 30-35 mm in length. If you look at the plot carefully, thevariance appears to be increasing with the length with the spread noticeably wider at 35 mm than at 30mm.

We will fit a model on the log-log scale: Note that there is some confusion in scientific papers abouta “log” transform. In general, a log-transformation refers to taking natural-logarithms (base e), andNOT the base-10 logarithm. This mathematical convention is often broken in scientific papers whereauthors try to use ln to represent natural logarithms, etc. It does not affect the analysis in anyway whichtransformation is used other than that values on the natural log scale are approximately 2.3 times largerthan values on the log10 scale. Of course, the appropriate back transformation is required.


ods graphics on;proc reg data=wtlen plots=all;

title2 ’fit the model using all of the data’;model log_weight = log_length / all;ods output OutputStatistics =Predictions;ods output ParameterEstimates=Estimates;




The model statement is what tells SAS that the response variable is logTEQ because it appears tothe left of the equal sign, and that the predictor variable is year because it appears to the right of theequal sign. The all options requests much output, and part of the output will be discussed below.The odsstatement requests that some output statistics are placed into a dataset called predictions.

Here is the fit:




The fit is not very satisfactory. The curve doesn’t seem to fit the two “outlier points very well”. Atsmaller lengths, the curve seems to under fitting the weight. The residual plot appears to show the twodefinite outliers and also shows some evidence of a poor fit with positive residuals at lengths 30 mm andnegative residuals at 35 mm.

The fit was repeated dropping the two largest fish with the following output:

We delete all large fish.

data wtlen3; /* append the plotting points and compute log() transformation */set wtlen;if length > 40 then delete;

run;





StandardError

tValue

Pr>|t|

Lower95%CLParameter

Upper95%CLParameter

Intercept 1 -3.55305 0.74431 -4.77 <.0001 -5.06735 -2.03875

log_length 1 2.76722 0.21319 12.98 <.0001 2.33348 3.20095

Here is a plot of the fitted line:

Now the fit appears to be much better. The relationship (on the log-scale) is linear, the residual plotlooks OK.

The estimated power coefficient is 2.76 (SE .21). We find the 95% confidence interval for the slope(the power coefficient): from the previous output.

The 95% confidence interval for the power coefficient is from (2.33 to 3.2) which includes the valueof 3 – hence the growth could be allometric, i.e. a fish that is twice the length also is twice the width andtwice the thickness. Of course, with this small sample size, it is too difficult to say too much.

The actual model in the population is:

log(weight) = β0 + β1 log(length) + ε

This implies that the “errors” in growth act on the LOG-scale. This seems reasonable.

For example, a regression on the original scale would make the assumption that a 20 g error inpredicting weight is equally severe for a fish that (on average) weighs 200 or 400 grams even though the



"error" is 20/200=10% of the predicted value in the first case, while only 5% of the predicted value in thesecond case. On the log-scale, it is implicitly assumed that the “errors” operate on the log-scale, i.e. a10% error in a 200 g fish is equally severe as a 10% error in a 400 g fish even though the absolute errorsof 20g and 40g are quite different.

Another assumption of regression analysis is that the population error variance is assumed to be con-stant over the entire regression line, but the original plot shows that the standard deviation is increasingwith length. On the log-scale, the standard deviation is roughly constant over the entire regression line.

A non-linear fit

It is also possible to do a direct non-linear least-squares fit. Here the objective is to find values of β0 andβ1 to minimize: ∑

(weight− β0 × lengthβ1)2

directly.

The Proc NLin procedure can be used to fit the non-linear least-squares line directly:

ods graphics on;proc nlin data=wtlen3 plots=all;

title2 ’non-linear least squares’;parameters b0=.03 b1=3;bounds b0 > 0;model weight = b0 * length**b1;ods output ParameterEstimates=NlinEstimates;



non-linear least squares

Parameter Estimate

ApproxStdError Alpha Lower Upper

tValue

ApproxPr>|t|

b0 0.0323 0.0275 0.05 -0.0237 0.0883 1.17 0.2485

b1 2.7332 0.2427 0.05 2.2395 3.2270 11.26 <.0001

The estimated power coefficient from the non-linear fit is 2.73 with a standard error of .24. Theestimated intercept is 0.0323 with an estimated standard error of .027. Both estimates are similar to theprevious fit.

Which is a better method to fit this data? The non-linear fit assumes that error are additive on theoriginal scale. The consequences of this were discussed earlier, i.e. a 20 g error is equally serious for a200 g fish as for a 400 g fish.



For this problem, both the non-linear fit and the fit on the log-scale gave the same results, but thiswill not always be true. In particular, look at the large difference in estimates when the models were fitto the all of the fish. The non-linear fit was more influenced by the two large fish - this is a consequenceof the minimizing the square of the absolute deviation (as opposed to the relative deviation) between theobserved weight and predicted weight.

14.4.14 Power/Sample Size

A power analysis and sample size determination can also be done for regression problems, but is morecomplicated than power analyses for simple experimental designs. This is for a number of reasons:

• The power depends not only on the total number of points collected, but also on the actual distri-bution of the X values.

For example, the power to detect a trend is different if the X values are evenly distributed over therange of predictors than if the X values are clustered at the ends of the range of the predictors. Aregression analysis has the most power to detect a trend if half the observations are collected at asmall X value and half of the observations are collected at a large X value. However, this type ofdata gives no information on the linearity (or lack there-of) between the two X values and is notrecommended in practice. A less powerful design would have a range of X values collected, butthis is often more of interest as lack-of-fit and non-linearity can be collected.

• Data collected for regression analysis is often opportunistic with little chance of choosing the Xvalues. Unless you have some prior information on the distribution of the X values, it is difficultto determine the power.

• The formula are clumsy to compute by hand, and most power packages tend not to have modulesfor power analysis of regression. However, modern software should be able to deal with this issue.

For a power analysis, the information required is similar to that requested for ANOVA designs:

• α level. As in power analyses for ANOVA, this is traditionally set to α = 0.05.

• effect size. In ANOVA, power deals with detection of differences among means. In regressionanalysis, power deals with detection of slopes that are different from zero. Hence, the effect sizeis measured by the slope of the line, i.e. the rate of change in the mean of Y per unit change in X .

• sample size. Recall in ANOVA with more than two groups, that the power depended not onlyonly the sample size per group, but also how the means are separated. In regression analysis, thepower will depend upon the number of observations taken at each value of X and the spread ofthe X values. For example, the greatest power is obtained when half the sample is taken at the twoextremes of the X space - but at a cost of not being able to detect non-linearity. It turns out that asimple summary of the distribution of the X values (the standard deviation of the X values) is allthat is needed.

• standard deviation. As in ANOVA, the power will depend upon the variation of the individualobjects around the regression line.

JMP (V.10) does not currently contain a module to do power analysis for regression. R also doesnot include a power computation module for regression analysis but I have written a small function thatis available in the SampleProgramLibrary. SAS (Version 9+) includes a power analysis module (GLM-POWER) for the power analysis. Russ Lenth also has a JAVA applet that can be used for determiningpower in a regression context http://homepage.stat.uiowa.edu/~rlenth/Power/.


http://homepage.stat.uiowa.edu/~rlenth/Power/


The problem simplifies considerably when the the X variable is time, and interest lies in detectinga trend (increasing or decreasing) over time. A linear regression of the quantity of interest against timeis commonly used to evaluate such a trend. For many monitoring designs, observations are taken ona yearly basis, so the question reduces to the number of years of monitoring required. The analysis oftrend data and power/sample size computations is treated in a following chapter.

Let us return to the example of the yield of tomatoes vs. the amount of fertilizer. We wish to designan experiment to detect a slope of 1 (the effect size). From past data (on a different field), the standarddeviation of values about the regression line is about 4 units (the standard deviation of the residuals).

We have enough money to plant 12 plots with levels of fertilizer ranging from 10 to 20. How doesthe power compare under different configuration of choices of fertilizer levels. More specifically, howdoes the power compare between using fertilizer levels (10, 11, 12, 13, 14, 15, 15, 16, 17, 18, 19, 20),i.e. an even distribution of levels of fertilizer, and (10, 10, 12, 12, 14, 14, 16, 16, 18, 18, 20, 20), i.e.doing two replicates at each level of fertilizer but doing fewer distinct levels?

SAS can compute the power for regression analysis using the GlmPower procedure. The basic idea isthat you first generate a fake dataset with the expected MEAN response (in this case the estimated pointson the regression line). Then this fake dataset is ‘analyzed’ to determine the power using the analyticalmethods of Stroup (1999)11

We first generate the MEAN response and put the values into a dataset.

%let beta1 = 1;%let beta0 = 10;

/* Generate the mean at each X value */data means;

do x=10, 10, 12, 12, 14, 14, 16, 16, 18, 18, 20, 20;mu = &beta0 + &beta1*x;output;

end;run;

This is then passed to the GlmPower procedure:

proc glmpower data=means;title2 ’Example 2 Power’;model mu = x;power

stddev = 4alpha = 0.05ntotal= 12 /* THis needs to match the number of data points */power = .;

ods output Output=glmpower_output2;run;

which computes the power. Most of the entries are self-explanatory.

11Stroup, W. W. (1999). Mixed model procedures to assess power, precision, and sample size in the design of experiments.Pages 15-24 in Proc. Biopharmaceutical Section. Am. Stat. Assoc., Baltimore, MD.



The output is:

Example 1: power values

AlphaStdDev

NTotal Power

0.05 4 12 0.65749

The power is computed to be 0.657.

The procedures can be re-run with the revised levels of fertilizer and the output is:

Example 2: power values

AlphaStdDev

NTotal Power

0.05 4 12 0.75991

The power has increased to 0.760.

This increase in power is intuitively correct – power increases in regression as the number of datapoints at each end of the X range increase, all else being equal.

The power to detect a range of slopes using the last set of X values was also computed (see the R andSAS code) and a plot of the power vs. the size of the slope can be made.



The power to detect smaller slopes is limited.

Russ Lenth’s power modules12 can be used to compute the power for these two cases. Here themodules require the standard deviation of the X values but this needs to be computed using the n divisorrather than the n− 1 divisor, i.e.

SDLenth(X) =

√∑(X −X)2

n

For the two sets of fertilizer values the SDs are 3.02765 and 3.41565 respectively.

The output from Lenth’s power analysis are:

12http://homepage.stat.uiowa.edu/~rlenth/Power/


http://homepage.stat.uiowa.edu/~rlenth/Power/


which match the earlier results (as they must).

14.4.15 The perils of R2

R2 is a “popular” measure of the fit of a regression model and is often quoted in research papers asevidence of a good fit etc. However, there are several fundamental problems ofR2 which, in my opinion,make it less desirable. A nice summary of these issues is presented in Draper and Smith (1998, AppliedRegression Analysis, p. 245-246).

Before exploring this, how is R2 computed and how is it interpreted?

While I haven’t discussed the decomposition of the Error SS into Lack-of-Fit and Pure error, this canbe done when there are replicated X values. A prototype ANOVA table would look something like:



Source df SS

Regression p− 1 A

Lack-of-fit n− p− ne B

Pure error ne C

Corrected Total n-1 D

where there are n observations and a regression model is fit with p additional X values over and abovethe intercept.

R2 is computed as

R2 =SS(regression)

SS(total)=A

D= 1− B + C

D

where SS(·) represents the sum of squares for that term in the ANOVA table. At this point, rerun thethree examples presented earlier to find the value of R2.

For example, in the fertilizer example, the ANOVA table is:

Analysis of VarianceSource DF Sum of Squares Mean Square F Ratio p-valueModel 1 225.18035 225.180 69.8800 <.0001Error 9 29.00147 3.222C. Total 10 254.18182

Here R2 = 225.18035/254.18182 = .885 = 88.5%.

R2 is interpreted as the proportion of variance in Y accounted for by the regression. In this case,almost 90% of the variation in Y is accounted for by the regression. The value ofR2 must range between0 and 1.

It is tempting to think that R2 must be measure of the “goodness of fit”. In a technical sense it is, butR2 is not a very good measure of fit, and other characteristics of the regression equation are much moreinformative. In particular, the estimate of the slope and the se of the slope are much more informative.

Here are some reasons, why I decline to use R2 very much:

• Overfitting. If there are no replicate X points, then ne = 0, C = 0, and R2 = 1 − BD . B has

n − p degrees of freedom. As more and more X variables are added to the model, n − p, and Bbecome smaller, and R2 must increase even if the additional variables are useless.

• Outliers distort. Outliers produce Y values that are extreme relative to the fit. This can inflate thevalue of C (if the outlier occurs among the set of replicate X values), or B if the outlier occurs ata singleton X value. In any cases, they reduce R2, so R2 is not resistant to outliers.

• People misinterpret high R2 as implying the regression line is useful. It is tempting to believethat a higher value of R2 implies that a regression line is more useful. But consider the pair ofplots below:



The graph on the left has a very high R2, but the change in Y as X varies is negligible. The graphon the right has a lower R2, but the average change in Y per unit change in X is considerable. R2

measures the “tightness” of the points about the line – the higher value of R2 on the left indicatesthat the points fit the line very well. The value of R2 does NOT measure how much actual changeoccurs.

• Upper bound is not always 1. People often assume that a low R2 implies a poor fitting line. Ifyou have replicate X values, then C > 0. The maximum value of R2 for this problem can bemuch less than 100% - it is mathematically impossible for R2 to reach 100% with replicated Xvalues. In the extreme case where the model “fits perfectly” (i.e. the lack of fit term is zero), R2

can never exceed 1− CD .

• No intercept models If there is no intercept then D =∑

(Yi − Y )2 does not exist, and R2 is notreally defined.

• R2 gives no additional information. In actual fact, R2 is a 1-1 transformation of the slope andits standard error, as is the p-value. So there is no new information in R2.

• R2 is not useful for non-linear fits. R2 is really only useful for linear fits with the estimatedregression line free to have a non-zero intercept. The reason is that R2 is really a comparisonbetween two types of models. For example, refer back to the length-weight relationship examinedearlier.

In the linear fit case, the two models being compared are

log(weight) = log(b0) + error

vs.log(weight) = log(b0) + b1 ∗ log(length) + error

and so R2 is a measure of the improvement with the regression line. [In actual fact, it is a 1-1transform of the test that β1 = 0, so why not use that statistics directly?]. In the non-linear fit case,the two models being compared are:

weight = 0 + error

vs.weight = b0 ∗ length ∗ ∗b1 + error

The model weight=0 is silly and so R2 is silly.

Hence, theR2 values reported are really all for linear fits - it is just that sometimes the actual linearfit is hidden.

• Not defined in generalized least squares. There are more complex fits that don’t assume equalvariance around the regression line. In these cases, R2 is again not defined.



• Cannot be used with different transformations of Y . R2 cannot be used to compare modelsthat are fit to different transformations of the Y variable. For example, many people try fitting amodel to Y and to log(Y ) and choose the model with the highest R2. This is not appropriate asthe D terms are no longer comparable between the two models.

• Cannot be used for non-nested models. R2 cannot be used to compare models with differentsets of X variables unless one model is nested within another model (i.e. all of the X variables inthe smaller model also appear in the larger model). So using R2 to compare a model with X1, X3,and X5 to a model with X1, X2, and X4 is not appropriate as these two models are not nested. Inthese cases, AIC should be used to select among models.

14.5 A no-intercept model: Fulton’s Condition Factor K

It is possible to fit a regression line that has an intercept of 0, i.e., goes through the origin. Most computerpackages have an option to suppress the fitting of the intercept.

The biggest ‘problem’ lies in interpreting some of the output – some of the statistics produced aremisleading for these models. As this varies from package to package, please seek advice when fittingsuch models.

The following is an example of where such a model may be sensible.

Not all fish within a lake are identical. How can a single summary measure be developed to representthe condition of fish within a lake?

In general, the the relationship between fish weight and length follows a power law:

W = aLb

where W is the observed weight; L is the observed length, and a and b are coefficients relating lengthto weight. The usual assumption is that heavier fish of a given length are in better condition than thanlighter fish. Condition indices are a popular summary measure of the condition of the population.

There are at least eight different measures of condition which can be found by a simple literaturesearch. Conne (1989) raises some important questions about the use of a single index to represent thetwo-dimensional weight-length relationship.

One common measure is Fulton’s13 K:

K =Weight

(Length/100)3

This index makes an implicit assumption of isometric growth, i.e. as the fish grows, its body proportionsand specific gravity do not change.

How can K be computed from a sample of fish, and how can K be compared among different subsetof fish from the same lake or across lakes?

The B.C. Ministry of Environment takes regular samples of rainbow trout using a floating and asinking net. For each fish captured, the weight (g), length (mm), sex, and maturity of the fish wasrecorded.

13There is some doubt about the first authorship of this condition factor. See Nash, R. D. M., Valencia, A. H., and Geffen, A. J.(2005). The Origin of Fulton’s Condition Factor – Setting the Record Straight. Fisheries, 31, 236-238.



The data is available in the rainbow-condition.csv file in the Sample Program Library at http://www.stat.sfu.ca/~cschwarz/Stat-Ecology-Datasets.


data fish;infile ’rainbow-condition.csv’ dlm=’,’ dsd missover firstobs=2 ;input net_type $ fish length weight species $ sex $ maturity $;lenmod = (length/100)**3;label lenmod=’(length/100)**3’;condition_factor = weight / lenmod;

run;

The ordering of the rows is NOT important; however, it is often easier to find individual data points ifthe data is sorted by the X value and the rows for future predictions are placed at the end of the dataset.Part of the raw data are shown below:

Obsnettype fish length weight species sex maturity (length/100)**3

conditionfactor sexL

1 Sinking 1 360 686 RB F MATURING 46.6560 14.7034 F.46.6560


3 Sinking 3 295 284 RB M MATURING 25.6724 11.0625 M.25.6724








K was computed for each individual fish, and the resulting histogram is displayed below:

proc sgplot data=fish;title2 ’Preliminary data plot’;scatter y=weight x=lenmod / markerattrs=(symbol=circlefilled);yaxis label=’Weight’ offsetmin=.05 offsetmax=.05;xaxis label=’(Length/100)**3’ offsetmin=.05 offsetmax=.05;

run;





There is a range of condition numbers among the individual fish with an average (among the fish caught)K of about 13.6.

Deriving a single summary measure to represent the entire population of fish in the lake dependsheavily on the sampling design used to capture fish.

Some case must be taken to ensure that the fish collected are a simple random sample from the fishin the population. If a net of a single mesh size are used, then this has a selectivity curve and the nets aretypically more selective for fish of a certain size. In this experiment, several different mesh sizes wereused to try and ensure that all fish of all sizes have an equal chance of being selected.

As well, if regression methods have an advantage in that a simple random sample from the populationis no longer required to estimate the regression coefficients. As an analogy, suppose you are interestedin the relationship between yield of plants and soil fertility. Such a study could be conducted by findinga random sample of soil plots, but this may lead to many plots with similar fertility and only a few plotswith fertility at the tails of the relationship. An alternate scheme is to deliberately seek out soil plots witha range of fertilities or to purposely modify the fertility of soil plots by adding fertilizer, and then fit aregression curve to these selected data points.

Fulton’s index is often re-expressed for regression purposes as:

W = K

(L

100

)3

This looks like a simple regression between W and(L

100

)3but with no intercept.

A plot of these two variables:



proc sgplot data=fish;title2 ’Preliminary data plot’;scatter y=weight x=lenmod / markerattrs=(symbol=circlefilled);yaxis label=’Weight’ offsetmin=.05 offsetmax=.05;xaxis label=’(Length/100)**3’ offsetmin=.05 offsetmax=.05;

run;

shows a tight relationship among fish but with possible increasing variance with length.

There is some debate about the proper way to estimate the regression coefficientK. Classical regres-sion methods (least squares) implicitly implies that all of the “error” in the regression is in the verticaldirection, i.e. conditions on the observed lengths. However, the structural relationship between weightand length likely is violated in both variables. This would lead to the error-in-variables problem inregression, which has a long history. Fortunately, the relationship between the two variables is oftensufficiently tight that it really doesn’t matter which method is used to find the estimates.


proc reg data=fish plots=all;title2 ’fit the model with NO intercept’;model weight = lenmod / noint all;ods output OutputStatistics =Predictions;ods output ParameterEstimates=Estimates;

run;





StandardError

tValue

Pr>|t|

Lower95%CLParameter

Upper95%CLParameter

lenmod 1 13.72947 0.09878 138.98 <.0001 13.53391 13.92502

Note thatR2 really doesn’t make sense in cases where the regression is forced through the origin becausethe null model to which it is being compared is the line Y = 0 which is silly.14

The estimated value of K is 13.72 (SE 0.099).

The residual plot:

shows clear evidence of increasing variation with the length variable. This usually implies that a weighted14 Consult any of the standard references on regression such as Draper and Smith for more details.



regression is needed with weights proportional to the 1/length2 variable. In this case, such a regressiongives essentially the same estimate of the condition factor (K = 13.67, SE = .11).

Comparing condition factors

This dataset has a number of sub-groups – do all of the subgroups have the same condition factor?For example, suppose we wish to compare the K value for immature and mature fish. This is covered inmore detail in the Chapter on the Analysis of Covariance (ANCOVA).

14.6 Frequent Asked Questions - FAQ

14.6.1 Do I need a random sample; power analysis

A student wrote:

I am studying the hydraulic geometry of small, steep streams in Southwest BC (abstractattached). I would like to define a regional hydraulic geometry for a fairly small hydro-logic/geologic homogeneous area in the coast mountains close to SFU. Hydraulic geometryis the study of how the primary flow variables (width, depth and velocity) change with dis-charge in a stream. Typically, a straight-regression line is fitted to data plotted on a log-logplot. The equation is of the form w = aQb where a is the intercept, b is the slope, w is thewater surface width, and Q is the stream discharge.

I am struggling with the last part of my research proposal which is how do I select (ran-domly) my field sites and how many sites are required. My supervisor - suggests that Iselect stream segments for study based on a-priori knowledge of my field area and selectstreams from across it. My argument is that to define a regionally applicable relationship(not just one that characterizes my chosen sites) I must randomly select the sites.

I think that GIS will help me select my sites but have the usual questions of how manysites are required to give me a certain level of confidence and whether or not I’m on theright track. As well, the primary controlling variables that I am looking at are dischargeand stream slope. I will be plotting the flow variables against discharge directly but willdeal with slope by breaking my stream segments into slope classes - I guess that the nullhypothesis would be that there is no difference in the exponents and intercepts between slopeclasses.

You are both correct!

If you were doing a simple survey, then you are correct in that a random sample from the entirepopulation must be selected - you can’t deliberately choose streams.

However, because you are interested in a regression approach, the assumption can be relaxed a bit.You can deliberately choose values of the X variables, but must randomly select from streams withsimilar X values.

As an analogy, suppose you wanted to estimate the average length of male adult arms. You wouldneed a random sample from the entire population. However, suppose that you were interested in therelationship between body height (X) and arm length (Y ). You could deliberately choose which Xvalues to measure - indeed it would be good idea to get a good contrast among the X values, i.e. findpeople who are 4 ft tall, 5 ft tall, 6 ft tall, 7 ft tall and measure their height and arm length and then fit



the regression curve. However, at each height level, you must now choose randomly among those peoplethat meet that criterion. Hence you could could deliberately choose to have 1/4 of people who are 4 fttall, 1/4 who are 5 feet tall, 1/4 who are 6 feet tall, 1/4 who are 7 feet tall which is quite different from theproportions in the population, but at each height level must choose people randomly, i.e. don’t alwayschoose skinny 4 ft people, and over-weight 7 ft people.

Now sample size is a bit more difficult as the required sample size depends both on the number ofstreams selected and how they are scattered along the X axis. For example, the highest power occurswhen observations are evenly divided between the very smallest X and very largest X value. However,without intermediate points, you can’t assess linearity very well. So you will want points scatteredaround the range of X values.

If you have some preliminary data, a power/sample size can be done using JMP, SAS, and otherpackages. If you do a google search for power analysis regression, there are several direct links toexamples. Refer to the earlier section of the notes.

14.7 Summary of simple linear regression

Objective: Estimate the slope (the relationship) between two continuous variables.

Experimental Design

• A Y is observed for a given X . The X’s can be randomly chosen from a population or set as partof an experiment.

• No pairing, blocking, or stratification.

• Ensure that the Experimental Unit (EU) is the same as the observational units (OU) to avoidpseudoreplication. This is particularly true in trend analysis when you should generally have 1number per year.

Data structureTwo columns in the dataset.

• X column for the predictor. This should be numeric.

• Y column for the response. This should be numeric.

The columns can be an any order. The rows can be in any order.The dataset is read in the usual ways.

Missing values/ UnbalanceIf there are missing values, ensure that they are Missing Completely at Random (MCAR). Any row witha missing value in Y or X will not be used in the regression.

Preliminary PlotA scatterplot of Y vs. X .Check for outliers. Check for leverage points. Check for non-linear relationships.

Table 1Compute a table of sample size, mean, and SD for each variable.



AnalysisAnalysis code:

• JMP: Analyze->Fit Y-by-X. Choose Fit Line from the drop down menu. Or Analyze->Fit Model.

• SAS: Proc Reg data=blah; Model Y= X

• R: lm( Y ∼ X, data=blah)

FollowupMake predicitons at new value of X .Be careful to distinguish between confidence intervals for the MEAN response and prediction intervalsfor the INDIVIDUAL response.

• R:

– newX <- data.frame(xvar=c(x1, x2, x3, ...)

– pred.avg <- predict(fit, newdata=newX, type="confidence") # ci for mean

– pred.indiv <- predict(fit, newdata=newX, type="prediction") # pi for individual values

Don’t forget model assessment. Look at residual and other plots. If the X variable is time, look atautocorrelation using the Durbin Watson statistic or similar diagnostic plots of lagged residuals.

What to write

• Materials and Methods: A simple linear regression was used to examine the relationship betweenY and X .

• Results: We found evidence of a relationship between Y and X (p = xxxx).

• Results: We found no evidence of a relationship between Y and X (p = xxxx).

• Results: The estimated slope is xxx (SE xxxx) units of Y per unit of X .

Report the actual p-value and not just if it was < .05 or > 0.05. Even if you fail to detect an effect, stillreport the effect sizes and their standard errors so that the reader can determine if the close to 0 (with asmall se) or if the se is so large that nothing was learned from the study.

Power AnalysisHard. Contact me.

CommentsThe terms independent and dependent variable are old-fashioned and should not be used. The preferredterms are predictor and response variable.

A very common error is to confuse the confidence interval for the mean and the prediction interval foran individual response and use the wrong interval.

Be VERY AFRAID of pseudoreplication especially if you are doing a regression against time (a trendanalysis). Consult the relevant chapter for more details..



R2 is not as useful as you might expect. R2 measures the closeness of points to the line but not the sizeof the slope. So you can all of the points close to the line (high R2 but a slope that is close to 0. Becautious in using R2.

It is VERY rare to do a regression with no intercept. Contact me.

A log(Y ) transform is very common. It is usually NOT necessary to transform any X variable.


correlation and simple linear...

Documents