lab 4 (more) linear regression

Lab 4 (More) Linear Regression

Troubleshooting Tips:Remember that when you use lm(y ~ x,), whereas when you use plot in the same script, it is plot (x, y). Order of presenting the variables in the arguments of these 2 commands is critically important, and getting them mixed up causes programming problems and misinformation resulting.

When you put abline() or curve() or line() commands to put lines/curves on an existing plot, and the command executes appropriately, but the line/curve doesn’t appear, probably the line is showing up on a part of the graph you have not included in your display. You can see where the line is plotted by extending the x or y axis, using xlim(c(a,b)) and ylim(c(e,f)), as additional arguments--these will make the axes plot from x going a to b and y going from e to f.

Transforming Variables:

When we look at scatter plots to find regression model fits for y on x, we would like to model a linear fit, if appropriate. We can determine linearity as appropriate by looking at the scatter plot, or more appropriately, by looking at the residual plot. We could also introduce a curve of best fit (like some sort of a lowess line, etc.), to help us believe that a linear fit is at least not inappropriate.

When a linear fit is not appropriate, we can transform the x, the y or both variables, in order to get some form of a line to fit the resulting transformed (x, y) points. We are not changing the data by transforming, if we use transformation functions which are called mathematically “one-to-one and onto”. It is like transforming a distance from feet to meters in units--we have not changed the data, just changed the description numbers describing the data.

Below is a nice chart of ideas how to transform variables, depending upon what you originally have.To do transform #1, use the following commands and formula:y-hat = βo +b1√x + ε lm(y ~ sqrt(x)) plot(sqrt(x), y)y-hat = βo + β1ln(x) + ε lm(y ~ log(x)) plot(log(x), y)y-hat = βo +β1 x-1 + ε lm(y ~ 1/x) plot(1/x,y)

To do transform #2, use the following commands and formula:y-hat = β0 + β1x + β2x2 + ε lm(y ~ x+I(x^2)) plot(x, y)

To do transform #4, use the following commands and formula:(y-hat)2 = β0 + β1x + ε lm(I(y^2) ~ x) plot(x, y^2)

1

To do transform #5, use the following commands and formula:ln(y-hat) = β0 + β1x +ε lm(log(y) ~ x) plot (x, log(y))ln(y-hat) = β0 + β1ln(x) + ε lm(log(y) ~ log(x)) plot(log(x), log(y))

there are other ways to do these formulas, for example, by making up a variable consisting of log(x), sqrt(x), etc. and call it var1, using var 1 in the formula of the lm() and plot () commands, as needed.

Predicting after Transforming:Once you have transformed, and you need to predict y from x, you must “un-transform” the variables to properly predict. For example if you have an appropriate model of a study which is of the form ln(y-hat) = β0 + β1*ln(x), and you want to predict a specific y1 from a given x1 in the domain of the x data, you would have to find β0 + β1*ln(x1), and then take the exponential of that result, in order to get y1-hat. To do this in R would consist of something like exp(predict(model1)) after creating your log-log linear model (model1).

2

Correlation:Once you have found an appropriate linear fit, possibly due to a good transformation to generate that fit, one of the statistics you want to look at is the correlation value, r. Below is an example of values of r with corresponding scatter plot pictures. Remember that you should not even attempt to use r unless linearity has been deemed appropriate as a fit.

You can get this from cor(x,y) or cor(y,x), since they will both be the same! You can also obtain the correlation by taking the square root of r2, which is listed as part of the output of the summary() command. Just remember that if you use the root of r2 from summary(), you

3

will always get a positive value. Remember that r takes the same sign as the slope of the line of best fit.

Regression Example:We want to coordinate with the material Jon is presently doing, so we want to investigate a set of 2 variable quantitative data, and investigate modeling the resultant scatter plot with a linear model. If that doesn't model well, then we want to try transforming a variable (or variables) to find a model which works well “enough”.I have a data set of 2006 airfares from various cities, covering various distances. A view of the Airfares.csv file is shown below.

We want to plot the scatter plot of AIRFARE on DISTANCE (y on x), along with a line of best fit, and also a residual plot. The residual plot is useful in determining if a line is a good model fit (i.e., where the residual plot shows no pattern), and that the criteria of fairly equal variation in y values across the x values is maintained (which allows us to use the regression model approach with some confidence). In short, the residual plot will show us no pattern to speak of and no “coning” on the points as we look across the x range of values. Then we can feel a bit better that a linear model is an appropriate model.

Let us use the code shown below.

par(mfrow=c(2,1))data1 <- read.csv("Airfares.csv")distance <- data1[,2]airfare <- data1[,3]linmodel1 <- lm(airfare ~ distance)plot(distance, airfare,main="Scatter Plot of y on x")abline(linmodel1)plot(distance, resid(linmodel1),main="Residual Plot of y on x")

4

abline(h=0)cor1 <- cor(distance, airfare)cat("correlation coefficient is", cor1, "\n")par(mfrow=c(1,1))

Notice in the code that the first thing we do is use the par() command to break up our graphing area into 2 halves, a top half and bottom half, to show both the scatter plot of the data imported by data1 as well as the residual plot of that data. We do that with the mfrow= attribute, giving a 2 row with one column display of my two graphs. After reading in the data I name the variables distance and airfares, respectively, sothat I can easily refer to them in my linear model (called linmodel1), in my plot()commands, and so that the plots have good x and y labels on them.When I am done with my plots, I found a correlation value of 0.795, which isn’t too bad. See output below.

5

and after I found the correlation coefficient value (which I have printed out on the output), I return the graphing window to a single graph (1 row and1 column) so I can return my graphing window to the original way I found it. I possiblyshould have made a more specific main title on each of my plots, but I was sort of in ahurry.

Also notice that when I was through plotting the 2 graphs, I reset the plot window to a single view, instead of a column with 2 row views on them, by par(mfrow=c(1,1)) .

Upon first inspection of the graphs, I am not overly impressed with my line of best fit being an appropriate fit. So, let me do another scatter plot where I superimpose a “lowess” line (which will show me the smooth curve “bearing” of the points). I used the code shown below.

scatter.smooth(distance, airfare,main="Scatter Plot with lowess curve")abline(linmodel1, lty=2)

Below is the resulting lowess curve (with dashed line of fit, using abline() and dashed lty=2 command) on the scatterplot.

6

Our smooth curve plot seems to agree with the line of best fit until we hit 2 outlying points, at x=600 mi and x=750 miles. If these points were gone, I bet my fit would be appropriate and r would be better. Normally, one should not even think of removing outliers from a plot, especially if the motive is just to make the regression “better”. However, if you have a plausible reason to do so, then investigate removal of that. I suspect that there was some sort of an advertising special rate given for the two cities (Chicago and St. Louis) from 600 to 800 miles, so I will remove the 2 points from the data set and rerun the regression.

Below is the code used for analyzing the airfare data without the 2 outliers.

The scatter plot with Chicago and St. Louis removed shows a little better appearing linear fit and better r.# airfares_revised analysis# -------------------------data3 <- read.csv("airfares_revised.csv")data3dist <- data3[,2]fare <- data3[,3]plot(dist, fare)linmodel3 <- lm(fare ~ dist)linmodel3abline(linmodel3)cat("2nd corr coeff is", cor(dist, fare), "\n")

7

Again, let us remember that we cannot just arbitrarily remove outliers because it makes our model “fit better”. We must investigate [1] why Chicago and St. Louis (and, I might add, New York) is seemingly different?, and [2] is it appropriate to remove the points?. Also, remember to notate the removal in any study you would present, for honesty in reporting purposes.

We still do not have a great linearity with our new points, but since the number of points left to plot here is very few, I guess I won’t try to use some sort of a transformation. I would want about at least twice the number of points to work with in order to productively come up with a transformed model.

Homework[1]: Use the dataset cloudseed.csv to generate a scatter plot with line of best fit, residual plot of a lm() model , model output, a smooth line plot through points, and correlation coefficient value on the response (SEEDED) cloud rainfall (units

8

unknown) vs the explanatory (UNSEEDED) rainfall. Then answer the question “Is a linear fit appropriate here?” The model being investigated here is y-hat = βo + β1 *x + ε, Give evidence for your answer. Include any mention of outliers.

Homework[2]: Possibly, transform #5 might give a better model, using model y-hat = β0 + β1log(x) + ε. So, do a log (x) transform, where you are making a model of SEEDED vs log(UNSEEDED) (note: natural log). So, your model will be lm(y ~ log(x)). Include above plots and summaries. Does this model improve the linear model above or not? Explain.

Homework[3]: Now do a log(x) with response of log(y) with the cloudseed data, so that your model will be lm(log(y) ~ log(x)). Using same graphs and information as above, explain why this is the best model. In short, explain why y-hat = β0 + β1log(x) + ε is the best model.

Homework[4]: The data file rmr.csv contains the resting metabolic rate (RMR) ofvarious primates having various weights (WEIGHT), done in 1997. Make a scatter plot of the original data (y ~ x), then of the y data logged (log(y) ~ x), then of both x and y logged (log(y) ~ log(x)). Determine which transformation, if any, gives the best model, and then produce all of the graphs and information requested in the prior homework.

9

lab 4 (more) linear regression

Documents