al nosedal university of toronto summer 2019
TRANSCRIPT
![Page 1: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/1.jpg)
Linear Regression
Al NosedalUniversity of Toronto
Summer 2019
Al Nosedal University of Toronto Linear Regression Summer 2019 1 / 118
![Page 2: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/2.jpg)
My momma always said: ”Life was like a box of chocolates. You neverknow what you’re gonna get.”
Forrest Gump.
Al Nosedal University of Toronto Linear Regression Summer 2019 2 / 118
![Page 3: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/3.jpg)
Regression Line
A regression line is a straight line that describes how a response variable ychanges as an explanatory variable x changes. We often use a regressionline to predict the value of y for a given value of x .
Al Nosedal University of Toronto Linear Regression Summer 2019 3 / 118
![Page 4: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/4.jpg)
Review of Straight Lines
Suppose that y is a response variable (plotted on the vertical axis) and xis an explanatory variable (plotted on the horizontal axis). A straight linerelating y to x has an equation of the form
y = a + bx
In this equation, b is the slope, the amount by which y changes when xincreases by one unit. The number a is the intercept, the value of y whenx = 0.
Al Nosedal University of Toronto Linear Regression Summer 2019 4 / 118
![Page 5: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/5.jpg)
City mileage, highway mileage
We expect a car’s highway gas mileage (mpg) to be related to its city gasmileage. Data for all 1040 vehicles in the government’s 2010 FuelEconomy Guide give the regression linehighway mpg = 6.554 + (1.016 x city mpg)for predicting highway mileage from city mileage.a) What is the slope of this line? Say in words what the numerical value ofthe slope tells you.b) What is the intercept? Explain why the value of the intercept is notstatistically meaningful.c) Find the predicted highway mileage for a car that gets 16 miles pergallon in the city. Do the same for a car with a city mileage of 28 mpg.
Al Nosedal University of Toronto Linear Regression Summer 2019 5 / 118
![Page 6: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/6.jpg)
Solutions
a) The slope is 1.016. On average, highway mileage increases by 1.016mpg for each additional 1 mpg change in city mileage.b) The intercept is 6.554 mpg. This is the highway mileage for anonexistent car that gets 0 mpg in the city. Although this interpretation isvalid, such prediction would be invalid because it involves considerableextrapolation.c) For a car that gets 16 mpg in the city, we predict highway mileage to be:
6.554 + (1.016)(16) = 22.81 mpg.
For a car that gets 28 mpg in the city, we predict highway mileage to be:
6.554 + (1.016)(28) = 35.002 mpg.
Al Nosedal University of Toronto Linear Regression Summer 2019 6 / 118
![Page 7: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/7.jpg)
What’s the line?
You use the same bar of soap to shower each morning. The bar weighs 80grams when it is new. Its weight goes down by 5 grams per day on theaverage. What is the equation of the regression line for predicting weightfrom days of use?
Al Nosedal University of Toronto Linear Regression Summer 2019 7 / 118
![Page 8: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/8.jpg)
Solution
The equation is:
weight = 80 − 5 × days
The intercept is 80 grams (the initial weight), and the slope is −5grams/day.
Al Nosedal University of Toronto Linear Regression Summer 2019 8 / 118
![Page 9: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/9.jpg)
Least-Squares Regression Line
The least-squares regression line of y on x is the line that makes the sumof the squares of the vertical distances of the data points from the line assmall as possible.
Al Nosedal University of Toronto Linear Regression Summer 2019 9 / 118
![Page 10: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/10.jpg)
Equation of the Least-Squares Regression Line
We have data on an explanatory variable x and a response variable y for nindividuals. From the data, calculate the means x and y and the standarddeviations Sx and Sy of the two variables, and their correlation r . Theleast-squares regression line is the line
y = a + bx
with slope
b = rSySx
and intercept
a = y − bx
Al Nosedal University of Toronto Linear Regression Summer 2019 10 / 118
![Page 11: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/11.jpg)
Coral reefs
We have previously discussed a study in which scientists examined data onmean sea surface temperatures (in degrees Celsius) and mean coral growth(in millimeters per year) over a several-year period at locations in the RedSea. Here are the data:
Sea Surface Temperature Growth
29.68 2.6329.87 2.5830.16 2.6030.22 2.4830.48 2.2630.65 2.3830.90 2.26
Al Nosedal University of Toronto Linear Regression Summer 2019 11 / 118
![Page 12: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/12.jpg)
a) Use your calculator to find the mean and standard deviation of both seasurface temperature x and growth y and the correlation r between x andy . Use these basic measures to find the equation of the least-squares linefor predicting y from x .b) Enter the data into your software or calculator and use the regressionfunction to find the least-squares line. The result should agree with yourwork in a) up to roundoff error.
Al Nosedal University of Toronto Linear Regression Summer 2019 12 / 118
![Page 13: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/13.jpg)
Solutions
a) x = 30.28 Sx = 0.4296y = 2.4557 Sy = 0.1578r = −0.8914.Hence,
b = rSySx
= (−0.8914)
(0.1578
0.4296
)= −0.3274
a = y − bx = 2.4557 − (−0.3274)(30.28) = 12.3693
b) Slope = - 0.3276 and intercept = 12.3758.
Al Nosedal University of Toronto Linear Regression Summer 2019 13 / 118
![Page 14: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/14.jpg)
Reading our data
# Step 1. Entering data;
# url of coral growth data;
coral_url=
"https://mcs.utm.utoronto.ca/~nosedal/data/coral.txt"
# importing data into R;
data = read.table(coral_url, header = TRUE);
Al Nosedal University of Toronto Linear Regression Summer 2019 14 / 118
![Page 15: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/15.jpg)
Least-squares Regression Line
response=data$Coral_growth;
explanatory=data$Avg_summer;
coral.reg=lm(response~explanatory);
Al Nosedal University of Toronto Linear Regression Summer 2019 15 / 118
![Page 16: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/16.jpg)
Means
# Finding means;
mean(response);
## [1] 2.515714
mean(explanatory);
## [1] 30.28
Al Nosedal University of Toronto Linear Regression Summer 2019 16 / 118
![Page 17: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/17.jpg)
Standard deviations and r
# Finding standard deviations and r;
sd(response);
## [1] 0.15076
sd(explanatory);
## [1] 0.4296122
cor(explanatory, response);
## [1] -0.8635908
Al Nosedal University of Toronto Linear Regression Summer 2019 17 / 118
![Page 18: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/18.jpg)
R code
names(coral.reg);
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
Al Nosedal University of Toronto Linear Regression Summer 2019 18 / 118
![Page 19: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/19.jpg)
a and b
coral.reg$coef;
## (Intercept) explanatory
## 11.6921347 -0.3030522
Al Nosedal University of Toronto Linear Regression Summer 2019 19 / 118
![Page 20: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/20.jpg)
Do heavier people burn more energy?
We have data on the lean body mass and resting metabolic rate for 12women who are subjects in a study of dieting. Lean body mass, given inkilograms, is a person’s weight leaving out all fat. Metabolic rate, incalories burned per 24 hours, is the rate at which the body consumesenergy.
Mass Rate Mass Rate
36.1 995 40.3 118954.6 1425 33.1 91348.5 1396 42.4 112442.0 1418 34.5 105250.6 1502 51.1 134742.0 1256 41.2 1204
Al Nosedal University of Toronto Linear Regression Summer 2019 20 / 118
![Page 21: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/21.jpg)
a) Make a scatterplot that shows how metabolic rate depends on bodymass. There is a quite strong linear relationship, with correlationr = 0.876.b) Find the least-squares regression line for predicting metabolic rate frombody mass. Add this line to your scatterplot.c) Explain in words what the slope of the regression line tells us.d) Another woman has a lean body mass of 45 kilograms. What is herpredicted metabolic rate?
Al Nosedal University of Toronto Linear Regression Summer 2019 21 / 118
![Page 22: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/22.jpg)
Scatterplot
35 40 45 50 55
900
1000
1100
1200
1300
1400
1500
Lean Body Mass (kg)
Metab
olic R
ate (c
alorie
s/day
)
Al Nosedal University of Toronto Linear Regression Summer 2019 22 / 118
![Page 23: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/23.jpg)
Solutions
b) the regression equation is
y = 201.2 + 24.026x
where y= metabolic rate and x= body mass.c) The slope tells us that on the average, metabolic rate increases byabout 24 calories per day for each additional kilogram of body mass.d) For x = 45 kg, the predicted metabolic rate isy = 1282.4 calories per day.
Al Nosedal University of Toronto Linear Regression Summer 2019 23 / 118
![Page 24: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/24.jpg)
R code
# Step 1. Entering data;
mass=c(36.1, 54.6, 48.5, 42.0, 50.6, 42.0,
40.3, 33.1, 42.4, 34.5, 51.1, 41.2);
rate=c(995, 1425, 1396, 1418, 1502, 1256,
1189, 913, 1124, 1052, 1347, 1204);
Al Nosedal University of Toronto Linear Regression Summer 2019 24 / 118
![Page 25: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/25.jpg)
R code
# Step 2. Making scatterplot;
plot(mass, rate ,pch=19,col="blue",
xlab="Lean Body Mass (kg)",
ylab="Metabolic Rate (calories/day)");
Al Nosedal University of Toronto Linear Regression Summer 2019 25 / 118
![Page 26: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/26.jpg)
Scatterplot
●
●●●
●
●●
●
●
●
●
●
35 40 45 50 55
900
1200
1500
Lean Body Mass (kg)
Met
abol
ic R
ate
(cal
orie
s/da
y)
Al Nosedal University of Toronto Linear Regression Summer 2019 26 / 118
![Page 27: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/27.jpg)
Regression Equation (R Code)
# Step 3. Finding Regression Equation;
metabolic.reg=lm(rate~mass);
Al Nosedal University of Toronto Linear Regression Summer 2019 27 / 118
![Page 28: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/28.jpg)
a and b
metabolic.reg$coef;
## (Intercept) mass
## 201.16160 24.02607
Al Nosedal University of Toronto Linear Regression Summer 2019 28 / 118
![Page 29: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/29.jpg)
Scatterplot with least-squares line
plot(mass,rate,
pch=19,col="blue", xlab="Lean Body Mass (kg)",
ylab="Metabolic Rate (calories/day)");
abline(metabolic.reg$coef, col="red");
Al Nosedal University of Toronto Linear Regression Summer 2019 29 / 118
![Page 30: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/30.jpg)
Scatterplot with least-squares line
●
●●●
●
●●
●
●
●
●
●
35 40 45 50 55
900
1200
1500
Lean Body Mass (kg)
Met
abol
ic R
ate
(cal
orie
s/da
y)
Al Nosedal University of Toronto Linear Regression Summer 2019 30 / 118
![Page 31: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/31.jpg)
Prediction
new<-data.frame(mass=45);
predict(metabolic.reg,newdata=new);
## 1
## 1282.335
Al Nosedal University of Toronto Linear Regression Summer 2019 31 / 118
![Page 32: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/32.jpg)
Facts about Least-Squares Regression
1. The distinction between explanatory and response variables is essentialin regression.2. The least-squares regression line always passes through the point (x , y)on the graph of y against x .3. The square of the correlation, r2, is the fraction of the variation in thevalues of y that is explained by the least-squares regression of y on x.
Al Nosedal University of Toronto Linear Regression Summer 2019 32 / 118
![Page 33: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/33.jpg)
What’s my grade?
In Professor Krugman’s economics course the correlation between thestudent’s total scores prior to the final examination and theirfinal-examination scores is r = 0.5. The pre-exam totals for all students inthe course have mean 280 and standard deviation 40. The final-examscores have mean 75 and standard deviation 8. Professor Krugman haslost Julie’s final exam but knows that her total before the exam was 300.He decides to predict her final-exam score from her pre-exam total.a) What is the slope of the least-squares regression line of final-examscores on pre-exam total scores in this course? What is the intercept?b) Use the regression line to predict Julie’s final-exam score.c) Julie doesn’t think this method accurately predicts how well she did onthe final exam. Use r2 to argue that her actual score could have beenmuch higher (or much lower) than the predicted value.
Al Nosedal University of Toronto Linear Regression Summer 2019 33 / 118
![Page 34: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/34.jpg)
Solutions
a) b = rSySy
= (0.5) 840 = 0.1
a = y − bx = 75 − (0.1)(280) = 47.Hence, the regression equation is y = 47 + 0.1x .b) Julie’s pre-final exam total was 300, so we would predict a final examscore of
y = 47 + (0.1)(300) = 77.
c) Julie is right ... with a correlation of r = 0.5, r2 = 0.25, so theregression line accounts for only 25% of the variability in student finalexam scores. That is, the regression line doesn’t predict final exam scoresvery well. Julie’s score could, indeed, be much higher or lower than thepredicted 77.
Al Nosedal University of Toronto Linear Regression Summer 2019 34 / 118
![Page 35: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/35.jpg)
Residuals
A residual is the difference between an observed value of the responsevariable and the value predicted by the regression line. That is, a residual isthe prediction error that remains after we have chosen the regression line:residual = observed y - predicted yresidual = y − y .
Al Nosedal University of Toronto Linear Regression Summer 2019 35 / 118
![Page 36: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/36.jpg)
Residual Plots
A residual plot is a scatterplot of the regression residuals against theexplanatory variable. Residual plots help us assess how well a regressionline fits the data.
Al Nosedal University of Toronto Linear Regression Summer 2019 36 / 118
![Page 37: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/37.jpg)
Residuals by hand
You have already found the equation of the least-squares line for predictingcoral growth y from mean sea surface temperature x .a) Use the equation to obtain the 7 residuals step-by-step. That is, findthe prediction y for each observation and then find the residual y − y .b) Check that (up to roundoff error) the residuals add to 0.c) The residuals are the part of the response y left over after thestraight-line tie between y and x is removed. Show that the correlationbetween the residuals and x is 0 (up to roundoff error). That thiscorrelation is always 0 is another special property of least-squaresregression.
Al Nosedal University of Toronto Linear Regression Summer 2019 37 / 118
![Page 38: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/38.jpg)
coral.coeff=coral.reg$coeff;
coral.coeff;
## (Intercept) explanatory
## 11.6921347 -0.3030522
coral.residuals=coral.reg$residuals;
coral.residuals[1];
## 1
## -0.0675456
coral.residuals[2];
## 2
## -0.05996569
Al Nosedal University of Toronto Linear Regression Summer 2019 38 / 118
![Page 39: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/39.jpg)
Solutions (residuals by hand)
a) The residuals are computed in the table below usingy = −0.3030522x + 11.6921347.
xi yi yi yi − yi29.68 2.63 2.6975456 -0.067545629.87 2.58 2.6399657 -0.059965730.16 2.60 2.5520805 0.127919530.22 2.48 2.5338974 0.066102630.48 2.26 2.4551038 0.024896230.65 2.38 2.403585 -0.02358530.90 2.26 2.3278219 -0.0678219
b)∑
(yi − yi ) = 5.5511151 × 10−17 (they sum to zero, except forrounding error.c) From software, the correlation between xi and yi − yi is −0.0000854,which is zero except for rounding.
Al Nosedal University of Toronto Linear Regression Summer 2019 39 / 118
![Page 40: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/40.jpg)
Do heavier people burn more energy?
Return to the example about lean body mass and metabolic rate. We willuse these data to illustrate influence.a) Make a scatterplot of the data that is suitable for predicting metabolicrate from body mass, with two new points added. Point A: mass 42kilograms, metabolic rate 1500 calories. Point B: mass 70 kilograms,metabolic rate 1400 calories. In which direction is each of these points anoutlier?b) Add three least-squares regression lines to your plot: for the original 12women, for the original women plus Point A, and for the original womenplus Point B. Which new point is more influential for the regression line?Explain in simple language why each new point moves the line in the wayyour graph shows.
Al Nosedal University of Toronto Linear Regression Summer 2019 40 / 118
![Page 41: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/41.jpg)
Reading our data
# Step 1. Entering data;
# url of metabolic rate data;
meta_url=
"https://mcs.utm.utoronto.ca/~nosedal/data/metabolic.txt"
# importing data into R;
data = read.table(meta_url, header = TRUE);
Al Nosedal University of Toronto Linear Regression Summer 2019 41 / 118
![Page 42: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/42.jpg)
Scatterplot
plot(data,pch=19,col="blue",
xlab="Lean Body Mass (kg)",
ylab="Metabolic Rate (calories/day)");
Al Nosedal University of Toronto Linear Regression Summer 2019 42 / 118
![Page 43: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/43.jpg)
Scatterplot
●
●●●
●
●●
●
●
●
●
●
35 40 45 50 55
900
1200
1500
Lean Body Mass (kg)
Met
abol
ic R
ate
(cal
orie
s/da
y)
Al Nosedal University of Toronto Linear Regression Summer 2019 43 / 118
![Page 44: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/44.jpg)
Least-Squares Regression Line
# Step 3. Finding L-S Regression Line;
mod=lm(data$Rate~data$Mass);
Al Nosedal University of Toronto Linear Regression Summer 2019 44 / 118
![Page 45: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/45.jpg)
Scatterplot + L-S Regression Line
plot(data,pch=19,col="blue",
xlab="Lean Body Mass (kg)",
ylab="Metabolic Rate (calories/day)");
abline(mod$coeff,col="red",lty=2);
# abline tells R to add a line to your
# scatterplot;
# lty= 2 is used to draw a dashed-line;
Al Nosedal University of Toronto Linear Regression Summer 2019 45 / 118
![Page 46: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/46.jpg)
Scatterplot + L-S Regression Line
●
●●●
●
●●
●
●
●
●
●
35 40 45 50 55
900
1200
1500
Lean Body Mass (kg)
Met
abol
ic R
ate
(cal
orie
s/da
y)
Al Nosedal University of Toronto Linear Regression Summer 2019 46 / 118
![Page 47: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/47.jpg)
Scatterplot + A +B
plot(data,pch=19,col="blue",
xlab="Lean Body Mass (kg)",
ylab="Metabolic Rate (calories/day)",
xlim=c(30,70),ylim=c(850,1600 ));
points(42,1500,pch="A",col="red");
#point A;
points(70,1400,pch="B",col="green");
#point B;
Al Nosedal University of Toronto Linear Regression Summer 2019 47 / 118
![Page 48: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/48.jpg)
●
●●●
●
●
●
●
●
●
●
●
30 40 50 60 70
1000
1400
Lean Body Mass (kg)
Met
abol
ic R
ate
(cal
orie
s/da
y)
AB
Al Nosedal University of Toronto Linear Regression Summer 2019 48 / 118
![Page 49: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/49.jpg)
Least-Squares Regression Lines
# Step 3. Finding L-S Regression Line;
mod=lm(data$Rate~data$Mass);
# original;
modA=lm(c(data$Rate,1500)~c(data$Mass,42));
# point A;
modB=lm(c(data$Rate,1400)~c(data$Mass,70));
# point B;
Al Nosedal University of Toronto Linear Regression Summer 2019 49 / 118
![Page 50: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/50.jpg)
Scatterplot + A +B + L-S Regression Lines
plot(data,pch=19,col="blue",
xlab="Lean Body Mass (kg)",
ylab="Metabolic Rate (calories/day)",
xlim=c(30,70),ylim=c(850,1600 ));
points(42,1500,pch="A",col="red");
points(70,1400,pch="B",col="green");
abline(mod$coeff,col="blue",lty=2);
abline(modA$coeff,col="red",lty=2);
abline(modB$coeff,col="green",lty=2);
Al Nosedal University of Toronto Linear Regression Summer 2019 50 / 118
![Page 51: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/51.jpg)
●
●●●
●
●
●
●
●
●
●
●
30 40 50 60 70
1000
1400
Lean Body Mass (kg)
Met
abol
ic R
ate
(cal
orie
s/da
y)
AB
Al Nosedal University of Toronto Linear Regression Summer 2019 51 / 118
![Page 52: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/52.jpg)
Adding a legend
legend("bottomright",
c("original","original + A","original + B"),
col=c("blue","red","green"),
lty=c(2,2,2),bty="n");
Al Nosedal University of Toronto Linear Regression Summer 2019 52 / 118
![Page 53: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/53.jpg)
●
●●●
●
●
●
●
●
●
●
●
30 40 50 60 70
1000
1400
Lean Body Mass (kg)
Met
abol
ic R
ate
(cal
orie
s/da
y)
AB
originaloriginal + Aoriginal + B
Al Nosedal University of Toronto Linear Regression Summer 2019 53 / 118
![Page 54: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/54.jpg)
Solutions
a) Point A lies above the other points; that is, the metabolic rate is higherthan we expect for the given body mass. Point B lies to the right of theother points; that is, it is an outlier in the x (mass) direction, and themetabolic rate is lower than we would expect.b) In the plot, the dashed blue line is the regression line for the originaldata. The dashed red line slightly above that includes Point A; it has avery similar slope to the original line, but a slightly higher intercept,because Point A pulls the line up. The third line includes Point B, themore influential point; because Point B is an outlier in the x direction, it”pulls” the line down so that it is less steep.
Al Nosedal University of Toronto Linear Regression Summer 2019 54 / 118
![Page 55: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/55.jpg)
Influential observations
An observation is influential for a statistical calculation if removing itwould markedly change the result of the calculation.The result of a statistical calculation may be of little practical use if itdepends strongly on a few influential observations.Points that are outliers in either the x or the y direction of a scatterplotare often influential for the correlation. Points that are outliers in the xdirection are often influential for the least-squares regression line.
Al Nosedal University of Toronto Linear Regression Summer 2019 55 / 118
![Page 56: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/56.jpg)
Example
The number of people living on American farms declined steadily duringlast century. Here are data on the farm population (millions of persons)from 1935 to 1980:
Year Population
1935 32.111940 30.51945 24.41950 23.01955 19.11960 15.61965 12.41970 9.71975 8.91980 7.2
Al Nosedal University of Toronto Linear Regression Summer 2019 56 / 118
![Page 57: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/57.jpg)
Example
a) Make a scatterplot of these data and find the least-squares regressionline of farm population on year.b) According to the regression line, how much did the farm populationdecline each year on the average during this period? What percent of theobserved variation in farm population is accounted for by linear changeover time?c) Use the regression equation (trendline) to predict the number of peopleliving on farms in 1990. Is this result reasonable? Why?
Al Nosedal University of Toronto Linear Regression Summer 2019 57 / 118
![Page 58: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/58.jpg)
R Code
# Step 1. Entering Data;
year=seq(1935,1980,by=5);
population=c(32.11,30.5,24.4,23.0,19.1,
15.6,12.4,9.7,8.9,7.2);
# seq creates a sequence of numbers;
# which starts at 1935 and ends at 1980;
# we want a distance of 5 between numbers;
Al Nosedal University of Toronto Linear Regression Summer 2019 58 / 118
![Page 59: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/59.jpg)
R Code, L-S Line
least.squares=lm(population~year);
least.squares
##
## Call:
## lm(formula = population ~ year)
##
## Coefficients:
## (Intercept) year
## 1167.1418 -0.5869
Al Nosedal University of Toronto Linear Regression Summer 2019 59 / 118
![Page 60: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/60.jpg)
R Code, L-S Line
cor(year,population);
## [1] -0.9884489
Al Nosedal University of Toronto Linear Regression Summer 2019 60 / 118
![Page 61: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/61.jpg)
Scatterplot
plot(year,population,pch=19);
abline(least.squares$coeff,col="red");
# pch=19 tells R to draw solid circles;
# abline tells R to add trendline;
Al Nosedal University of Toronto Linear Regression Summer 2019 61 / 118
![Page 62: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/62.jpg)
Scatterplot
●●
●●
●
●
●● ●
●
1940 1950 1960 1970 1980
1020
30
year
popu
latio
n
Al Nosedal University of Toronto Linear Regression Summer 2019 62 / 118
![Page 63: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/63.jpg)
Solution
a) The scatterplot shows a strong negative association with a straight-linepattern. The regression line (trendline) is y = 1167.14 − 0.587x .b) This is the slope - about 0.587 million (587, 000) per year during thisperiod. Because r ≈ −0.9884, the regression line explains r2 ≈ 97.7% ofthe variation in population.c) Substituting, x = 1990 gives y = 1167.14 − 0.587(1990) = −0.99, animpossible result because a population must be greater than or equal to 0.The rate of decrease in the farm population dropped in the 1980s. Bewareof extrapolation.
Al Nosedal University of Toronto Linear Regression Summer 2019 63 / 118
![Page 64: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/64.jpg)
The endangered manatee
The table shown below gives 33 years of data on boats registered inFlorida and manatees killed by boats. If we made a scatterplot for thisdata set, it would show a strong positive linear relationship. Thecorrelation is r = 0.951.a) Find the equation of the least-squares line for predicting manatees killedfrom thousands of boats registered. Because the linear pattern is sostrong, we expect predictions from this line to be quite accurate - but onlyif conditions in Florida remain similar to those of the past 33 years.b) In 2009, experts predicted that the number of boats registered inFlorida would be 975,000 in 2010. How many manatees do you predictwould be killed by boats if there are 975,000 boats registered? Explainwhy we can trust this prediction.c) Predict manatee deaths if there were no boats registered in Florida.Explain why the predicted count of deaths is impossible.
Al Nosedal University of Toronto Linear Regression Summer 2019 64 / 118
![Page 65: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/65.jpg)
Table
Year Boats Manatees Year Boats Manatees
1977 447 13 1988 675 431978 460 21 1989 711 501979 481 24 1990 719 471980 498 16 1991 681 531981 513 24 1992 679 381982 512 20 1993 678 351983 526 15 1994 696 491984 559 34 1995 713 421985 585 33 1996 732 601986 614 33 1997 755 541987 645 39 1998 809 66
Al Nosedal University of Toronto Linear Regression Summer 2019 65 / 118
![Page 66: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/66.jpg)
Table (cont.)
Year Boats Manatees
1999 830 822000 880 782001 944 812002 962 952003 978 732004 983 692005 1010 792006 1024 922007 1027 732008 1010 902009 982 97
Al Nosedal University of Toronto Linear Regression Summer 2019 66 / 118
![Page 67: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/67.jpg)
Solutions
a) The regression line is y = −43.172 + 0.129x .b) If 975, 000 boats are registered, then by our scale, x = 975, andy = −43.172 + (0.129)(975) = 82.6 manatees killed. The predictionseems reasonable, as long as conditions remain the same, because ”975” iswithin the space of observed values of x on which the regression line wasbased. That is, this is not extrapolation.c) If x = 0 (corresponding to no registered boats), then we would”predict” −43.172 manatees to be killed by boats. This is absurd, becauseit is clearly impossible for fewer than 0 manatees to be killed. Note thatx = 0 is well outside the range of observed values of x on which theregression line was based.
Al Nosedal University of Toronto Linear Regression Summer 2019 67 / 118
![Page 68: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/68.jpg)
Extrapolation
Extrapolation is the use of a regression line for prediction far outside therange of values of the explanatory variable x that you used to obtain theline. Such predictions are often not accurate.
Al Nosedal University of Toronto Linear Regression Summer 2019 68 / 118
![Page 69: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/69.jpg)
Association does not imply causation
An association between an explanatory variable x and a response variabley , even if it is very strong, is not by itself good evidence that changes in xactually cause changes in y .
Al Nosedal University of Toronto Linear Regression Summer 2019 69 / 118
![Page 70: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/70.jpg)
Example
Measure the number of television sets per person x and the average lifeexpectancy y for the world’s nations. There is a high positive correlation:nations with many TV sets have higher life expectancies.The basic meaning of causation is that by changing x we can bring abouta change in y . Could we lengthen the lives of people in Rwanda byshipping them TV sets? No. Rich nations have more TV sets than poornations. Rich nations also have longer life expectancies because they offerbetter nutrition, clean water, and better health care. There is nocause-and-effect tie between TV sets and length of life.
Al Nosedal University of Toronto Linear Regression Summer 2019 70 / 118
![Page 71: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/71.jpg)
Is math the key to success in college?
A College Board study of 15,941 high school graduates found a strongcorrelation between how much math minority students took in high schooland their later success in college. New articles quoted the head of theCollege Board as saying that ”Math is the gatekeeper for success incollege.” Maybe so, but we should also think about lurking variables.What might lead minority students to take more or fewer high school mathcourses? Would these same factors influence success in college?
Al Nosedal University of Toronto Linear Regression Summer 2019 71 / 118
![Page 72: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/72.jpg)
Solution
A student’s intelligence may be a lurking variable: stronger students (whoare more likely to succeed when they get to college) are more likely tochoose to take these math courses, while weaker students may avoid them.Other possible answers may be variations on this idea; for example, if webelieve that success in college depends on a student’s self-confidence, andperhaps confident students are more likely to choose math courses.
Al Nosedal University of Toronto Linear Regression Summer 2019 72 / 118
![Page 73: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/73.jpg)
Lurking Variable
A lurking variable is a variable that is not among the explanatory orresponse variables in a study and yet may influence the interpretation ofrelationships among those variables.
Al Nosedal University of Toronto Linear Regression Summer 2019 73 / 118
![Page 74: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/74.jpg)
Another example
There is some evidence that drinking moderate amounts of wine helpsprevent heart attacks. A table shown below gives data on yearly wineconsumption (liters of alcohol from drinking wine, per person) and yearlydeaths from heart disease (deaths per 100,000 people) in 19 developednations∗.
Al Nosedal University of Toronto Linear Regression Summer 2019 74 / 118
![Page 75: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/75.jpg)
Another example
a) Make a scatterplot that shows how national wine consumption helpsexplain heart disease death rates.b) Describe the form of the relationship. Is there a linear pattern? Howstrong is the relationship?c) Is the direction of the association positive or negative? Explain insimple language what this says about wine and heart disease. Do youthink these data give good evidence that drinking wine causes a reductionin heart disease deaths? Why?
Al Nosedal University of Toronto Linear Regression Summer 2019 75 / 118
![Page 76: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/76.jpg)
Table
Country Alcohol Heart Country Alcohol Heartfrom disease from diseasewine deaths wine deaths
Australia 2.5 211 Netherlands 1.8 167Austria 3.9 167 New Zealand 1.8 266Belgium 2.9 131 Norway 0.8 227Canada 2.4 191 Spain 6.5 86
Denmark 2.9 220 Sweden 1.6 207Finland 0.8 297 Switzerland 5.8 115France 9.1 71 United Kingdom 1.3 285Iceland 0.8 211 United States 1.2 199Ireland 0.7 300 West Germany 2.7 172
Italy 7.9 107
Al Nosedal University of Toronto Linear Regression Summer 2019 76 / 118
![Page 77: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/77.jpg)
Solution (Bar chart)
# Step 1. Entering data;
consumption=c(2.5, 3.9, 2.9, 2.4, 2.9, 0.8, 9.1,
0.8, 0.7, 7.9, 1.8, 1.9, 0.8, 6.5, 1.6, 5.8, 1.3, 1.2, 2.7);
death.rates=c(211, 167, 131, 191, 220, 297, 71,
211, 300, 107,167, 266, 227, 86, 207, 115, 285, 199, 172);
Al Nosedal University of Toronto Linear Regression Summer 2019 77 / 118
![Page 78: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/78.jpg)
Scatterplot (R code)
plot(consumption,death.rates);
Al Nosedal University of Toronto Linear Regression Summer 2019 78 / 118
![Page 79: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/79.jpg)
Scatterplot (R code)
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
2 4 6 8
100
200
300
consumption
deat
h.ra
tes
Al Nosedal University of Toronto Linear Regression Summer 2019 79 / 118
![Page 80: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/80.jpg)
Another example (cont.)
Our table gives data on wine consumption and heart disease death rates in19 countries. A scatterplot shows a moderately strong relationship.a) The correlation for these variables is r = −0.843. What does a negativecorrelation say about wine consumption and heart disease deaths?b) The least-squares regression line for predicting heart disease death ratefrom wine consumption is
y = 260.56 − 22.969x
Verify this using R. Then use this equation to predict the heart diseasedeath rate in another country where adults average 4 liters of alcohol fromwine each year.
Al Nosedal University of Toronto Linear Regression Summer 2019 80 / 118
![Page 81: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/81.jpg)
a) Finding correlation
cor(consumption,death.rates);
## [1] -0.8428127
Al Nosedal University of Toronto Linear Regression Summer 2019 81 / 118
![Page 82: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/82.jpg)
Least-squares Regression Line
explanatory<-consumption;
response<-death.rates;
wine.reg<-lm(response~explanatory);
Al Nosedal University of Toronto Linear Regression Summer 2019 82 / 118
![Page 83: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/83.jpg)
R code
names(wine.reg);
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
Al Nosedal University of Toronto Linear Regression Summer 2019 83 / 118
![Page 84: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/84.jpg)
a and b
wine.reg$coef;
## (Intercept) explanatory
## 260.56338 -22.96877
Al Nosedal University of Toronto Linear Regression Summer 2019 84 / 118
![Page 85: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/85.jpg)
Prediction
wine.reg$coef[1]+wine.reg$coef[2]*4;
## (Intercept)
## 168.6883
Al Nosedal University of Toronto Linear Regression Summer 2019 85 / 118
![Page 86: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/86.jpg)
Prediction (again...)
new=data.frame(explanatory=4);
predict(wine.reg,newdata=new);
## 1
## 168.6883
Al Nosedal University of Toronto Linear Regression Summer 2019 86 / 118
![Page 87: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/87.jpg)
c) The association is negative: Countries with high wine consumption havefewer heart disease deaths, while low wine consumption tends to go withmore deaths from heart disease. This does not prove causation; there maybe some other reason for the link.
Al Nosedal University of Toronto Linear Regression Summer 2019 87 / 118
![Page 88: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/88.jpg)
Our main example
One effect of global warming is to increase the flow of water into theArctic Ocean from rivers. Such an increase might have major effects onthe world’s climate. Six rivers (Yenisey, Lena, Ob, Pechora, Kolyma, andSevernaya Dvina) drain two-thirds of the Arctic in Europe and Asia.Several of these are among the largest rivers on earth. File arctic-rivers.datcontains the total discharge from these rivers each year from 1936 to 1999.Discharge is measured in cubic kilometers of water.
Al Nosedal University of Toronto Linear Regression Summer 2019 88 / 118
![Page 89: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/89.jpg)
Reading our data
# url of arctic rivers data;
riv_url=
"https://mcs.utm.utoronto.ca/~nosedal/data/arctic-rivers.txt"
# importing data into R;
arctic_rivers = read.table(riv_url, header = TRUE);
Al Nosedal University of Toronto Linear Regression Summer 2019 89 / 118
![Page 90: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/90.jpg)
Scatterplot (R code)
plot(arctic_rivers$Year,arctic_rivers$Discharge);
Al Nosedal University of Toronto Linear Regression Summer 2019 90 / 118
![Page 91: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/91.jpg)
Scatterplot (R code)
●●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●●
●●●
●
●●●
●
●●
●
●●
●
●
●●
●
●
●
1940 1960 1980 2000
1600
1900
arctic_rivers$Year
arct
ic_r
iver
s$D
isch
arge
Al Nosedal University of Toronto Linear Regression Summer 2019 91 / 118
![Page 92: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/92.jpg)
Scatterplot (R code)
plot(arctic_rivers$Year,arctic_rivers$Discharge,
pch=19,col="blue");
Al Nosedal University of Toronto Linear Regression Summer 2019 92 / 118
![Page 93: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/93.jpg)
Scatterplot (R code)
●●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●●
●●●
●
●●●
●
●●
●
●●
●
●
●●
●
●
●
1940 1960 1980 2000
1600
1900
arctic_rivers$Year
arct
ic_r
iver
s$D
isch
arge
Al Nosedal University of Toronto Linear Regression Summer 2019 93 / 118
![Page 94: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/94.jpg)
Scatterplot (R code)
plot(arctic_rivers$Year,arctic_rivers$Discharge,
pch=19,col="blue", xlab="Year",
ylab="Discharge");
Al Nosedal University of Toronto Linear Regression Summer 2019 94 / 118
![Page 95: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/95.jpg)
Scatterplot (R code)
●●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●●
●●●
●
●●●
●
●●
●
●●
●
●
●●
●
●
●
1940 1960 1980 2000
1600
1900
Year
Dis
char
ge
Al Nosedal University of Toronto Linear Regression Summer 2019 95 / 118
![Page 96: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/96.jpg)
The scatterplot shows a weak positive, linear relationship.
Al Nosedal University of Toronto Linear Regression Summer 2019 96 / 118
![Page 97: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/97.jpg)
Our main example
r=cor(arctic_rivers$Year,arctic_rivers$Discharge);
r;
## [1] 0.3343926
Al Nosedal University of Toronto Linear Regression Summer 2019 97 / 118
![Page 98: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/98.jpg)
The scatterplot shows a weak positive, linear relationship, which isconfirmed by r (0.3343926).
Al Nosedal University of Toronto Linear Regression Summer 2019 98 / 118
![Page 99: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/99.jpg)
R code
explanatory=arctic_rivers$Year;
response=arctic_rivers$Discharge
rivers.reg=lm(response~explanatory);
Al Nosedal University of Toronto Linear Regression Summer 2019 99 / 118
![Page 100: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/100.jpg)
R code
names(rivers.reg);
## [1] "coefficients" "residuals" "effects" "rank"
## [5] "fitted.values" "assign" "qr" "df.residual"
## [9] "xlevels" "call" "terms" "model"
Al Nosedal University of Toronto Linear Regression Summer 2019 100 / 118
![Page 101: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/101.jpg)
a and b
rivers.reg$coef;
## (Intercept) explanatory
## -2056.769460 1.966163
Al Nosedal University of Toronto Linear Regression Summer 2019 101 / 118
![Page 102: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/102.jpg)
Scatterplot with least-squares line
plot(explanatory,response,
pch=19,col="blue", xlab="Year",
ylab="Discharge");
abline(rivers.reg$coef, col="red");
Al Nosedal University of Toronto Linear Regression Summer 2019 102 / 118
![Page 103: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/103.jpg)
Scatterplot with least-squares line
●●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●●
●●●
●
●●●
●
●●
●
●●
●
●
●●
●
●
●
1940 1960 1980 2000
1600
1900
Year
Dis
char
ge
Al Nosedal University of Toronto Linear Regression Summer 2019 103 / 118
![Page 104: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/104.jpg)
Residuals
A residual is the difference between an observed value of the responsevariable and the value predicted by the regression line. That is,
residual = observed y − predicted y = y − y .
Al Nosedal University of Toronto Linear Regression Summer 2019 104 / 118
![Page 105: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/105.jpg)
Scatterplot with residual line segments
plot(explanatory,response,
pch=19,col="blue", xlab="Year",
ylab="Discharge");
abline(rivers.reg$coef, col="red");
segments(explanatory, fitted(rivers.reg),
explanatory,response, lty=2, col="black");
Al Nosedal University of Toronto Linear Regression Summer 2019 105 / 118
![Page 106: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/106.jpg)
Scatterplot with residual line segments
●●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●●
●●●
●
●●●
●
●●
●
●●
●
●
●●
●
●
●
1940 1960 1980 2000
1600
1900
Year
Dis
char
ge
Al Nosedal University of Toronto Linear Regression Summer 2019 106 / 118
![Page 107: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/107.jpg)
Residual Plots
A residual plot is a scatterplot of the regression residuals against theexplanatory variable. Residual plots help us assess the fit of a regressionline.A residual plot magnifies the deviations of the points from the line andmakes it easier to see unusual observations and patterns.
Al Nosedal University of Toronto Linear Regression Summer 2019 107 / 118
![Page 108: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/108.jpg)
Residual plot
plot(explanatory,resid(rivers.reg),
pch=19,col="blue", xlab="Year",
ylab="Residual");
abline(h=0, col="red",lty=2);
Al Nosedal University of Toronto Linear Regression Summer 2019 108 / 118
![Page 109: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/109.jpg)
Residual plot
●●
●
●
●
●
●
●
●
●
●
●●
●
●●●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●●
●●
●
●
●●●
●
●●
●
●●
●
●
●●
●
●
●
1940 1960 1980 2000
−20
00
Year
Res
idua
l
Al Nosedal University of Toronto Linear Regression Summer 2019 109 / 118
![Page 110: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/110.jpg)
Example: Counting carnivores
Ecologist look at data to learn about nature’s patterns. One pattern theyhave found relates the size of a carnivore (body mass in kilograms) to howmany of those carnivores there are in an area. The right measure of ”howmany” is to count carnivores per 10,000 kilograms of their prey in thearea. Below we show a table that gives data for 25 carnivore species. Tosee the pattern, plot carnivore abundance against body mass. Biologistoften find that patterns involving sizes and counts are simpler when weplot the logarithms of the data.
Al Nosedal University of Toronto Linear Regression Summer 2019 110 / 118
![Page 111: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/111.jpg)
Table: Size and abundance of carnivores
Carnivore Body Abundancespecies mass (kg)
Least weasel 0.14 1656.49Ermine 0.16 406.66
Small Indian mongoose 0.55 514.84Pine marten 1.3 31.84
Kit fox 2.02 15.96Channel Island fox 2.16 145.94
Arctic fox 3.19 21.63Red fox 4.6 32.21Bobcat 10 9.75
Canadian lynx 11.2 4.79European badger 13 7.35
Coyote 13 11.65Ethiopian wolf 14.5 2.7
Al Nosedal University of Toronto Linear Regression Summer 2019 111 / 118
![Page 112: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/112.jpg)
Table: Size and abundance of carnivores
Carnivore Body Abundancespecies mass (kg)
Eurasian lynx 20 0.46Wild dog 25 1.61
Dhole 25 0.81Snow leopard 40 1.89
Wolf 46 0.62Leopard 46.5 6.17Cheetah 50 2.29
Puma 51.9 0.94Bobcat 10 9.75
Spotted hyena 58.6 0.68Lion 142 3.4Tiger 181 0.33
Polar bear 310 0.6
Al Nosedal University of Toronto Linear Regression Summer 2019 112 / 118
![Page 113: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/113.jpg)
Reading our data
# Step 1. Entering data;
# url of carnivores;
carnivores_url=
"https://mcs.utm.utoronto.ca/~nosedal/data/carnivores.txt"
# importing data into R;
carnivores = read.table(carnivores_url, header = TRUE);
Al Nosedal University of Toronto Linear Regression Summer 2019 113 / 118
![Page 114: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/114.jpg)
Abundance vs Body Mass
# Step 2. Making scatterplot;
plot(carnivores$B.mass,carnivores$Abundance,pch=19,
col="blue",xlab="Carnivore body mass(kgs)",ylab="Abundance",
main=" ");
# main adds title to graph;
Al Nosedal University of Toronto Linear Regression Summer 2019 114 / 118
![Page 115: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/115.jpg)
Abundance vs Body Mass
●
●●
●●●●●●●●●●●●● ●●●●●● ● ● ● ●
0 50 100 200 300
050
015
00
Carnivore body mass(kgs)
Abu
ndan
ce
Al Nosedal University of Toronto Linear Regression Summer 2019 115 / 118
![Page 116: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/116.jpg)
log(Abundance) vs log(Body Mass)
# Step 2. Making time plot;
plot(log10(carnivores$B.mass),log10(carnivores$Abundance),
pch=19, col="blue",xlab="log(Body mass)",
ylab="log(Abundance)", main=" ");
# main adds title to graph;
Al Nosedal University of Toronto Linear Regression Summer 2019 116 / 118
![Page 117: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/117.jpg)
log(Abundance) vs log(Body Mass)
●
● ●
●●
●
●●
●●●●
●
●
●●
●
●
●●●
●
●
●
●●
−0.5 0.5 1.0 1.5 2.0 2.5
01
23
log(Body mass)
log(
Abu
ndan
ce)
Al Nosedal University of Toronto Linear Regression Summer 2019 117 / 118
![Page 118: Al Nosedal University of Toronto Summer 2019](https://reader034.vdocument.in/reader034/viewer/2022051114/6277719dc067bf276604cf3e/html5/thumbnails/118.jpg)
This scatterplot shows a moderately strong negative association.Bigger carnivores are less abundant. The form of the association is linear.It is striking that animals from many different parts of the world should fitso simple a pattern. We could use the straight-line pattern to predict theabundance of another carnivore species from its body mass (Homework?).
Al Nosedal University of Toronto Linear Regression Summer 2019 118 / 118