© buddy freeman, 2015 simple linear regression and correlation
TRANSCRIPT
© Buddy Freeman, 2015
Simple Linear Regression
and
Correlation
© Buddy Freeman, 2015
Tennessee Energy Consumption Estimates (taken from http://www.eia.doe.gov/emeu/states/sep_use/tra/use_tra_tn.html)
Motor Gasoline EthanolYear (in 1000s of barrels) (in 1000s of barrels)1983 53,310 2811984 56,348 5921985 57,068 6861986 59,317 8571987 56,506 1,2771988 58,224 1,4101989 58,937 1,0791990 56,954 5831991 55,187 4261992 57,667 5161993 60,286 5931994 62,062 8411995 63,907 3581996 63,928 71997 65,162 71998 66,842 81999 69,151 02000 68,252 02001 67,385 0
Correlation question:From 1983 to 2001 in the state of Tennessee, were motor gasoline consumption and ethanol consumption significantly related to each other?
In a correlation problem, oneis interested in measuring the strength of the relationship between variables.
© Buddy Freeman, 2015
Tennessee Energy Consumption Estimates (taken from http://www.eia.doe.gov/emeu/states/sep_use/tra/use_tra_tn.html)
Motor Gasoline EthanolYear (in 1000s of barrels) (in 1000s of barrels)1983 53,310 2811984 56,348 5921985 57,068 6861986 59,317 8571987 56,506 1,2771988 58,224 1,4101989 58,937 1,0791990 56,954 5831991 55,187 4261992 57,667 5161993 60,286 5931994 62,062 8411995 63,907 3581996 63,928 71997 65,162 71998 66,842 81999 69,151 02000 68,252 02001 67,385 0
Regression question:From 1983 to 2001 in the state of Tennessee, could the ethanol consumption in one year have been used to predict motor gasoline consumption in the followingyear?
In a regression problem, oneis interested in predicting one variable (called the dependent variable) based on another variable (called the independent variable).
© Buddy Freeman, 2015
Simple Linear Regression
and
Correlation
The Key Word
© Buddy Freeman, 2015
Simple Linear Regression
and
Correlation
A Straight Line
© Buddy Freeman, 2015
What is the equation for a straight line?
bmxy Do you recall ?
mWhat is ?
bWhat is ?
x is the independent variable, and y is the dependent variable.
Answer: the slope
Answer: the y-intercept
In the text, the equation is given by:
xbby 10ˆ
© Buddy Freeman, 2015
Given a random sample of the related x and y values,
xbbY ii 10ˆ
The General Simple Linear Regression Problem
find the value of the slope and the value of the y-intercept that yields the “best” fit to these points.
i Xi Yi
1 X1 Y1
2 X2 Y2
. . .
. . .
. . .n Xn Yn
© Buddy Freeman, 2015
X
YGiven a random sample of the related x and y values, find the value of the slope and the value of the y-intercept that yields the “best” fit to these points.
xbbY ii 10ˆ
Visually
What does “best” mean? By “best” we mean the smallest error in prediction.
© Buddy Freeman, 2015
xbbY ii 10ˆ
X
YError Defined
By “best” we mean the smallest error in prediction.
If one picks an arbitrary point in the random sample, (Xi, Yi),how “far” is the point from theline: ?
Yi is the actual y value.
xbbY ii 10ˆ is the predicted y-value.
(value on the line)
xbbY ii 10ˆ
The error is the difference between Yi and .
Error = ii YY ˆ}
© Buddy Freeman, 2015
X
YGeneral Problem Restated
Error = ii YY ˆ}
Given a random sample of the related x and y values, find the value of the slope and the value of the y-intercept that yields the smallest error over all the sample.
ii YY ˆWhat would you want ?
}{
}{
{
}
{
{
The errors for the points above the line should balance the errors for the points below the line, resultingin a sum of zero.
Unfortunately, there are an infinite number of lines possessing this property. Any line that passes through the point, ,will have this property, because it is a property of the mean.
YX ,
© Buddy Freeman, 2015
X
YGeneral Problem Restated in terms of Least Squares
Error = ii YY ˆ}
Given a random sample of the related x and y values, find the value of the slope and the value of the y-intercept that yields the smallest sum of the squares of the errors (SSE) over all the sample.
}{
}{
{
}
{
{
2ˆ SSE ii YY
Find the value of b0 and the value of b1 that will minimize
where xbbY ii 10ˆ
© Buddy Freeman, 2015
X
YSolution of the Least Squares Problem
Error = ii YY ˆ}
}{
}{
{
}
{
{
Find the value of b0 and the value of b1 that will minimize
where xbbY ii 10ˆ 2ˆ SSE ii YY
210 SSE ii xbbY
Noting that SSE is a function of two variables, we can restate the problem once again.
© Buddy Freeman, 2015
X
YSolution of the Least Squares Problem
Error = ii YY ˆ}
}{
}{
{
}
{
{
Find the value of b0 and the value of b1 that will minimize
f(b0, b1) =
210 SSE ii xbbY
Finding the values of variables that will maximize/minimize a function is a calculus problem. Becausecalculus is not a prerequisiteto this course, the details are omitted, but the process results intwo equationsand two unknowns.
© Buddy Freeman, 2015
The Normal Equations
or
matrix form algebraic form
There are many ways to solve a system of two equations and two unknowns. If you have a favorite, feel free to use it.
yx
y
b
b
xx
xn
1
0
2
xybxbx
ybxbn
12
0
10
Two relationships that I expect you to know are:
andXin variation
Y & Xin n covariatio1 b
n
xb
n
yb 10
© Buddy Freeman, 2015
The Normal Equations
or
matrix form algebraic form
There are many ways to solve a system of two equations and two unknowns. If you have a favorite, feel free to use it.
Now the specifics are introduced with an example.
yx
y
b
b
xx
xn
1
0
2
xybxbx
ybxbn
12
0
10
Two relationships that I expect you to know are:
andXin variation
Y & Xin n covariatio1 b
n
xb
n
yb 10
© Buddy Freeman, 2015
DayX = Number
of CommercialsY = Water Consumption
(gallons)
1 11 8,0002 7 5,0003 12 9,0004 8 4,0005 10 8,0006 13 10,0007 8 5,0008 10 7,0009 14 10,000
10 7 4,000
The Random Sample
© Buddy Freeman, 2015
Generate Graph
First, graph the data. The scatter plot of the data may indicate that a linear model is totally inappropriate and a waste of time.
The following three slides give some examples of nonlinear patterns.
Following the nonlinear examples, the graph of the data in the random sample is constructed.
© Buddy Freeman, 2015
0
5
10
15
20
25
30
35
40
45
0 5 10 15 20
Example of a Nonlinear Pattern
cbxaxxf 2
© Buddy Freeman, 2015
0
100
200
300
400
500
600
700
0 5 10 15 20 25
dcxbxaxxf 23
Example of a Nonlinear Pattern
© Buddy Freeman, 2015
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
0 5 10 15 20
axbexf
Example of a Nonlinear Pattern
© Buddy Freeman, 2015
0
5
10
15
20
25
30
35
40
45
50
55
60
65
70
75
80
85
90
0 5 10 15 20
axbexf
axbbexf ax lnlnln
Transformed to a linear pattern
Example of a Nonlinear Pattern
© Buddy Freeman, 2015
(11, 8000)
( 7, 5000)
(12, 9000)
( 8, 4000)
(10, 8000)
(13, 10000)
( 8, 5000)
(10, 7000)
(14, 10000)
( 7, 4000)
Number of Commercials
H2OConsumption
The Scatter Graph
© Buddy Freeman, 2015
Number of Commercials
H2OConsumption
Find the slope and the y-intercept ofthe line that is the “best” fit to these points.
xbbY ii 10ˆ
The Scatter Graph
© Buddy Freeman, 2015
Number of Commercials
H2OConsumption
Find the slope and the y-intercept ofthe line that is the “best” fit to these points.
xbbY ii 10ˆ
The Scatter Graph(with “guesstimated” line)
© Buddy Freeman, 2015
Worksheet:
Day X X2XY Y Y2
1 11 121 88,000 8,000 64,000,0002 7 49 35,000 5,000 25,000,0003 12 144 108,000 9,000 81,000,0004 8 64 32,000 4,000 16,000,0005 10 100 80,000 8,000 64,000,0006 13 169 130,000 10,000 100,000,0007 8 64 40,000 5,000 25,000,0008 10 100 70,000 7,000 49,000,0009 14 196 140,000 10,000 100,000,00010 7 49 28,000 4,000 16,000,000
TOTALS 100 1,056 751,000 70,000 540,000,000
The Initial Calculations
© Buddy Freeman, 2015
DefinitionalFormula
ComputationalFormula
2XXSSX i
2i YY SSTSSY
YYXXY)&(Xcov ii
2ii YYSSE
2i YYSSR SSX
Y&Xcov 2
SSR-SST
n
YXXY
n
YY
22
n
XX
22
Some Basic Formulas
© Buddy Freeman, 2015
X = Number of Commercials;Y = Water Consumption (gallons)
100 X 056,12 X
000,751XY
000,70Y 000,000,5402Y
10n
XbYbY- 10 intercept n
X
n
Yb
1
value.predicted theˆ
:model regression Sample
10 xbbY ii
714.910
56
000,51
10100
056,1
10000,70100
000,751
2
14.107,210
100714.910
10
000,70
value.predicted the714.91014.107,2ˆ xY ii
SSX
& cov
Xin variation
Y and Xin n covariatio 1
YXbslope
© Buddy Freeman, 2015
Interpretation of the Slope and the Y-intercept
value.predicted the714.91014.107,2ˆ xY ii
X = Number of Commercials;Y = Water Consumption (gallons)
Interpret the slope.(What does the slope mean in terms of the problem?)
For each additional commercial, we expect the water consumption to increase by 910.714 gallons.
Interpret the y-intercept.(What does the y-intercept mean in terms of the problem?)
If there are no commercials, we expect the water consumption to be a negative 2,107.14 gallons.
?????????? Think about it. ??????????
© Buddy Freeman, 2015
Microsoft Clip ArtMicrosoft Clip Art
Welcome toMulvany, Tennessee
CityWaterPlant
Reservoir
Sensorline
River
If the water consumption is a negative 2,107.14 gallons, which way is the water flowing in the pipe from the reservoir to the city?
We know that the water doesnot flow back into the reservoir.
Does this result mean that theregression model is worthless?
© Buddy Freeman, 2015
Interpolation versus Extrapolation
DayX = Number
of CommercialsY = Water Consumption
(gallons)
1 11 8,0002 7 5,0003 12 9,0004 8 4,0005 10 8,0006 13 10,0007 8 5,0008 10 7,0009 14 10,000
10 7 4,000smallest Xlargest X
smallest X
© Buddy Freeman, 2015
Number of Commercials
H2OConsumption
Between the smallest (7) and the largest (14) values of X used to compute the sample regression model,we may interpolate with statistical significance.
xbbY ii 10ˆ
Interpolation versus Extrapolation
( 7, 5000)( 7, 4000)
smallest X7
(14, 10000)
largest X14
Relevant Range
Inter
polat
ion
ExtrapolationExtrapolation
To determine if the model has statistical significance, we still have to perform some more calculations.
© Buddy Freeman, 2015
Calculation of SSE by Definition
1 11 8,000 7,910.7142 7 5,000 4,267.8583 12 9,000 8,821.4284 8 4,000 5,178.5725 10 8,000 7,000.0006 13 10,000 9,732.1427 8 5,000 5,178.5728 10 7,000 7,000.0009 14 10,000 10,642.856
10 7 4,000 4,267.858
iX iY2,107.14ˆ iY
iX910.714i
© Buddy Freeman, 2015
Calculation of SSE by Definition
1 11 8,000 7,910.7142 7 5,000 4,267.8583 12 9,000 8,821.4284 8 4,000 5,178.5725 10 8,000 7,000.0006 13 10,000 9,732.1427 8 5,000 5,178.5728 10 7,000 7,000.0009 14 10,000 10,642.856
10 7 4,000 4,267.858
iX iY2,107.14ˆ iY
iX910.714i
First, you insert the Xi values into the sample regression equation to calculate the predicted values.
© Buddy Freeman, 2015
Calculation of SSE by Definition
Second, you calculate the deviations of the points from the line.
1 11 8,000 7,910.714 89.2862 7 5,000 4,267.858 732.1423 12 9,000 8,821.428 178.5724 8 4,000 5,178.572 -1,178.5725 10 8,000 7,000.000 1,000.0006 13 10,000 9,732.142 267.8587 8 5,000 5,178.572 -178.5728 10 7,000 7,000.000 0.0009 14 10,000 10,642.856 -642.856
10 7 4,000 4,267.858 -267.8580.000
iX iY ii YY ˆ2,107.14ˆ iY
iX910.714i
First, you insert the Xi values into the sample regression equation to calculate the predicted values.
© Buddy Freeman, 2015
1 11 8,000 7,910.714 89.286 7,971.9902 7 5,000 4,267.858 732.142 536,031.9083 12 9,000 8,821.428 178.572 31,887.9594 8 4,000 5,178.572 -1,178.572 1,389,031.9595 10 8,000 7,000.000 1,000.000 1,000,000.0006 13 10,000 9,732.142 267.858 71,747.9087 8 5,000 5,178.572 -178.572 31,887.9598 10 7,000 7,000.000 0.000 0.0009 14 10,000 10,642.856 -642.856 413,263.837
10 7 4,000 4,267.858 -267.858 71,747.9080.000 SSE = 3,553,571.429
iX iY ii YY ˆ 2ii YY 2,107.14ˆ iY
iX910.714i
Calculation of SSE by Definition
Second, you calculate the deviations of the points from the line.
First, you insert the Xi values into the sample regression equation to calculate the predicted values.
Finally, you calculate the squares of the deviations of the points from the line and sum them to obtain SSE.
© Buddy Freeman, 2015
Calculation of SSE by “Backing” into it
= variation explained by regression
SSESSRSST +
variation not explainedby regression
© Buddy Freeman, 2015
Calculation of SSE by “Backing” into it
Therefore,
SSR.SSTSSE
= variation explained by regression
SSESSRSST +
variation not explainedby regression
© Buddy Freeman, 2015
Calculation of SSE by “Backing” into it
However,
n
YY
22
000,000,5010
000,70000,000,540
2
Therefore,
SSR.SSTSSE
= variation explained by regression
SSESSRSST +
variation not explainedby regression
SSY SST
© Buddy Freeman, 2015
Calculation of SSE by “Backing” into it
and
Hence,
However,
n
YY
22
000,000,5010
000,70000,000,540
2
Therefore,
SSR.SSTSSE
= variation explained by regression
SSESSRSST +
variation not explainedby regression
SSY SST
SSX
Y&XcovSSR
2
.5714.428,446,4656
51,000 2
42863,553,571.
.571446,446,428 - 50,000,000
SSRSSTSSE
© Buddy Freeman, 2015
standard error of the estimate = eS = 2eS
error variance = =
Calculation of the Standard Error of the Estimate
2eS
MSE 2-n
SSE
42857.196,4448
4286.571,553,3
standard error of the estimate = eS = 2eS
48.666
42857.196,444
gallons
Interpretation:The “typical” error made when predicting the number of gallons of water consumed based on the number of commercials is about 666.48 gallons.
© Buddy Freeman, 2015
The Question
At the .05 level of significance, is there evidence that a linear relationship exists between the number of commercials and water consumption?
We have almost enough calculated to be ableto answer the question.
just one more...........................................................
© Buddy Freeman, 2015
standard error of the slope =1bS
(also called the standard error of the regression coefficient, b1)
1bSSSX
Se
Calculation of the Standard Error of the Slope
062.8956
48.666
© Buddy Freeman, 2015
or
1
112
bn S
bt
MSE
MSR
S
SF
e
Rn 2
2
2,1
Test Statistics for Regression
1Now, what is ?
Well, that’s another story.
© Buddy Freeman, 2015
The End