1 simple linear regression chapter 16. 2 introduction in this chapter we examine the relationship...
Post on 21-Dec-2015
216 views
TRANSCRIPT
1
Simple Linear Regression Simple Linear Regression
Chapter 16
2
Introduction
• In this chapter we examine the relationship among interval variables via a mathematical equation.
• The motivation for using the technique:– Forecast the value of a dependent variable (y) from
the value of independent variables (x1, x2,…xk.).– Analyze the specific relationships between the
independent variables and the dependent variable.
3
16.1 Simple Linear Regression Model
The model has a deterministic and a probabilistic components
4
House size
HouseCost
Most lots sell for $25,000
Building a house costs about
75$ per square foot.
House cost = 25000 + 75(Size)
The Deterministic part of the model
5
House cost = 25000 + 75(Size)
However, house cost may vary even among same size houses!
The Model
6
House cost = 25000 + 75(Size)
House size
HouseCost
Most lots sell for $25,000
The Model
Since cost behave unpredictably,we add a random component.
7
The Model
• The simple first order (linear) regression model is:
y = dependent variablex = independent variable0 = y-intercept
1 = slope of the line
= error variable
xy 10 xy 10
x
y
0 Run
Rise = Rise/Run
0 and 1 are unknown populationparameters, therefore are estimated from the data.
8
16.2 The Least Squares Method
• The estimates are determined by – drawing a sample from the population of interest,– calculating sample statistics.– producing a straight line that cuts into the data.
Question: What is the best line to describe the specific linear relationship?
x
y
9
The Least Squares (Regression) Line
A good line is considered one, that minimizes the sum of squared errors.
X
Actual value of Y
Equation value of YError
10
The Least Squares (Regression) Line
3
3
41
1
4
(1,2)
2
2
(2,4)
(3,1.5)
(4,3.2)
2.5
Here is a short comparison of two possible lines drawn over 4 data points: 1. Horizontal line2. Positively increasing lineWhich line is better?
11
The Least Squares (Regression) Line
3
3
41
1
4
(1,2)
2
2
(2,4)
(3,1.5)
(4,3.2)
Sum of squared differences = (2 - 1)2 + (4 - 2)2 + (1.5 - 3)2 + (3.2 - 4)2 = 6.89Sum of squared differences = (2 -2.5)2 + (4 - 2.5)2 + (1.5 - 2.5)2 + (3.2 - 2.5)2 = 3.99
2.5
The smaller the sum of squared differencesthe better the fit of the line to the data.
12
The Estimated Coefficients
To calculate the estimates of 0 and 1 that minimize the differences between the data points and the line, use the formulas shown below (alternative formulae are suggested later):
xbyb
)xx()yy)(xx(
b
10
2i
ii1
xbyb
)xx()yy)(xx(
b
10
2i
ii1
The regression equation that estimatesthe equation of the first order linear modelis:
xbby 10 xbby 10
13
• Example 2– A car dealer wants to find
the relationship between the odometer reading and the selling price of 3-year old Tauruses.
– A random sample of 100 cars is selected, and the data recorded.
– Find the regression line.
Car Price Odometer1 14.6 37.42 14.1 44.83 14.0 45.84 15.6 30.95 15.6 31.76 14.7 34.0. . .. . .. . .
Dependent variable y
Inependent variable x
The Simple Linear Regression Line
14
• In order to more easily use Excel results, we’ll use another version of the formula for b1:
COV (the covariance of x and y) is a measure of the common ‘movement’ of x and y (do both generally increase together, or move in opposite directions)
xbyb
S
y)cov(x,b
10
2x
1
xbyb
S
y)cov(x,b
10
2x
1
The Simple Linear Regression Line
cov(x,y) is also denoted by Sxy
15
-2.909
Excel provides a population covariance of -2.879. The sample covariance is:-2.879n/(n-1)=-2.879(100/(99) = -2.909
2i
ii
2i2
x
)x(x
)y)(yx(xy)cov(x,
43.5091n
)x(xs
14.841;y36.01;x
The Simple Linear Regression Line
• Solution– Solving by hand: Calculate statistics. See data in
where n = 100.17,248
011).0669)(36.(14,822.82xbyb
.066943.509
2.909s
Y)cov(X,b
10
2x
1
.0669x17.248xbby 10 ˆ
Car Price
16
• Solution – continued– Using the computer:
Tools > Data analysis > Regression > [Shade the y range and the x range] > OK
The Simple Linear Regression Line
Car Price
17
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.805168R Square 0.648295Adjusted R Square0.644707Standard Error0.326489Observations 100
ANOVAdf SS MS F Significance F
Regression 1 19.255607 19.255607 180.6429887 5.75078E-24Residual 98 10.446293 0.1065948Total 99 29.7019
CoefficientsStandard Error t Stat P-value Lower 95% Upper 95%Intercept 17.24873 0.1820926 94.725045 3.57186E-98 16.88737056 17.61008Odometer -0.066861 0.0049746 -13.44035 5.75078E-24 -0.076732894 -0.056989
669x17,248y 0.ˆ
The Simple Linear Regression Line
Odometer readingSelling price
Car Price
18
Odometer Line Fit Plot
13000
14000
15000
16000
Odometer
Pri
ce
.0669x17,248y ˆ
Interpreting the Linear Regression -Equation
The intercept is b0 = $17,248.
0 No data
• The regression equation describes the linear relationship within the range covered by the sample only. • Thus, do not interpret the intercept as the “Price of cars that have not been driven”
17025
This is the slope of the line.For each additional mile on the odometer,the price decreases by an average of $0.0669
19
Interpreting the Linear Regression -Equation
• Remember: The regression equation pertains to the sample only!!
• To generalize the results by making inference about the population, we are about to apply statistical inference techniques.
20
16.3 Model Assumptions
• The error is a critical part of the regression model.• Four requirements involving must be satisfied.
– The probability distribution of is normal.– The mean of is zero: E() = 0.– The standard deviation of is for all values of x.– The set of errors associated with different values of y are
all independent.
21
The Normality of
The mean value of y for a given value of x is E(y|x) = 0 + 1x + E() = 0 + 1x, since E() = 0.
0 + 1x1
0 + 1x2
0 + 1x3
E(y|x2)
E(y|x3)
x1 x2 x3
E(y|x1)
The standard deviation remains constant,
but the mean value changes with x
Recall: y = 0+1x+
Since 0+1x is deterministicand is normally distributed,y is also normally distributed.
The standard deviation of y is for all values of y
22
+
+
+
+
+
++
++
+
+
+
+
+
+
++
++
+
+
++
+
Notice, that for small values of y, and for large values of y the errorsare mostly negative, while for midrange values of y the errors are positive.The errors are not independent.Consequently, linear regression is not the correct model to work with here.One can also question the assumption of ‘The mean error is zero’.
The independence of the errorsHere is a case where linear regression is not the rightmodel to apply to.
23
16.4 Assessing the Model
• The least squares method produces a regression line whether or not there is a linear relationship between x and y.
• Consequently, it is important to assess how well the linear model fits the data (how strong the linear relationship is).
• Several methods are used to assess the model. All are based on the sum of squares for errors, SSE.
24
• SSE is the sum of vertical differences between the points and the regression line. It was the function we minimized when constructing the regression equation.
• It can serve as a measure of how well the line fits the data. SSE is defined by
.)yy(SSEn
1i
2ii
.)yy(SSEn
1i
2ii
Sum of Squares for Errors
2x
2Y
s
)Y,Xcov(s)1n(SSE
2x
2Y
s
)Y,Xcov(s)1n(SSE
• A shortcut formulao
xi
yi+
iy
25
If is small the errors tend to be close to their mean, and the model fits the data well.Since the mean error is equal to zero, the model fits the data well when is close to zero.Therefore, we can, use as a measure of the suitability of using a linear model.To do this we need to estimate . An unbiased estimator of
2 is given by s2
2nSSE
s2ε
2nSSE
s2ε
Standard Error of Estimate
The standard error of estimateis defined by s
26
Testing the Slope
Different inputs (x) yielddifferent outputs (y).
No linear relationship.Different inputs (x) yieldthe same output (y).
The slope is not equal to zero The slope is equal to zero
Linear relationship.Linear relationship.Linear relationship.Linear relationship.
27
• We can draw inference about 1 from b1 by testingH0: 1 = 0H1: 1 = 0 (or < 0,or > 0)– The test statistic is
– If the error variable is normally distributed, the statistic is Student t distribution with d.f. = n-2.
1b
11
sb
t
1b
11
sb
t
The standard error of b1.
2x
bs)1n(
ss
1
2x
bs)1n(
ss
1
where
Testing the Slope
28
• Example 4– Test to determine whether there is enough evidence to infer
that there is a linear relationship between the car auction price and the odometer reading for all three-year-old Tauruses, in example 2. Use = 5%.
– Solution. The alternative hypothesis here (H1) is of the form ‘Not equal to zero’, because we try to show that there is a linear relationship, which can be verified by either positive value or negative value for 1.
Testing the Slope,Example
29
• Solving by hand– To compute “t” we need the values of b1 and sb1.
– For the two tails rejection region, t > t.025 or t < -t.025 with = n-2 = 98. From the t-table t.025 = 1.984 approximately,
13.44.00497
0.0669s
βbt
.004979)(99)(43.50
.3265
1)s(n
ss
.0669b
1
1
b
11
2x
εb
1
Testing the Slope,Example
Obtained from Descriptive Statisticsin Excel.
30
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.805168R Square 0.648295Adjusted R Square0.644707Standard Error0.326489Observations 100
ANOVAdf SS MS F Significance F
Regression 1 19.2556074 19.2556074 180.6429887 5.75078E-24Residual 98 10.4462926 0.10659482Total 99 29.7019
CoefficientsStandard Error t Stat P-value Lower 95% Upper 95%Intercept 17.24873 0.18209257 94.7250453 3.57186E-98 16.88737056 17.61008Odometer -0.066861 0.00497464 -13.4403493 5.75078E-24 -0.076732894 -0.056989
• Using the computer
Testing the Slope,Example
There is overwhelming evidence to inferthat the odometer reading affects the auction selling price.`
Car Price
31
– To measure the strength of the linear relationship we use the coefficient of determination.
Coefficient of determination R2
Here the line explains allthe variation between the different ‘y’ values.
Here the line explains onlysome of the variation among the ‘y’ values (there are errors).
32
Coefficient of determination
• The mathematical formulation of R2 is based on the following characteristic ( stated without a proof):
Overall variation in y =
The variation of the regression model around the mean +
The variation of the actual points around the line
That is: SST = SSR + SSE
Observe next
33
Coefficient of Determination = [Variation in y explained by the linear relationship]
[Total Variation in y
Coefficient of Determination R2
2i
2i
2
)yy(SSRand)yy(SSTWhereSSTSSR
R
2i
2i
2
)yy(SSRand)yy(SSTWhereSSTSSR
R
34
Coefficient of determination - insight
SST = SSR + SSE
• When the relationship between x and y is perfectly linear – there are no deviation of the actual points from the line,so SSE = 0. Therefore, SSR = SST and R2 = SSR/SST = 1
• When the relationship between x and y is not perfectly linear –there are deviations of the actual points from the regression line,so SSE > 0. Therefore, SSR < SST, and R2 = SSR/SST < 1.
• When no linear relationship exists between x and y – none of the total variation among the actual points is explainedby a linear relationship, so SST = SSE. Therefore, R2 = 0.
35
• Example 3– Find the coefficient of determination for example 2; what
does this statistic tell you about the model?• Solution
– Solving by hand; we use an alternative form of the R2 formula, that makes it easier to use Excel.
.6483ss
Y)][cov(X,R 3000)(43.509)(.
2.909][2y
2x
22 2
Coefficient of determination,Example
36
– Using the computer From the regression output we have
Coefficient of determination – Example
SUMMARY OUTPUT
Regression StatisticsMultiple R 0.805168R Square 0.648295Adjusted R Square0.644707Standard Error0.326489Observations 100
ANOVAdf SS MS F Significance F
Regression 1 19.2556074 19.2556074 180.6429887 5.75078E-24Residual 98 10.4462926 0.10659482Total 99 29.7019
CoefficientsStandard Error t Stat P-value Lower 95% Upper 95%Intercept 17.24873 0.18209257 94.7250453 3.57186E-98 16.88737056 17.61008Odometer -0.066861 0.00497464 -13.4403493 5.75078E-24 -0.076732894 -0.056989
65% of the variation in the auctionselling price is explained by the variation in odometer reading. Therest (35%) remains unexplained bythis model.
37
Coefficient of Correlation
• The coefficient of correlation is used to measure the strength of association between two variables.
• The coefficient values range between -1 and 1.– If r = -1 (negative association) or r = +1 (positive
association) every point falls on the regression line.– If r = 0 there is no linear pattern.
• The coefficient can be used to test for linear relationship between two variables.
38
• If we are satisfied with how well the model fits the data, we can use it to predict the values of y.
• To make a prediction we use– Point prediction, and– Interval prediction
16.6 Using the Regression Equation
• Before using the regression model, we need to assess how well it fits the data.
39
Point Prediction
• Example 5– Predict the selling price of a three-year-old Taurus
with 40,000 miles on the odometer (Example 2).
– It is predicted that a 40,000 miles car would sell for $14,575.
– How close is this prediction to the real price?
575,14)000,40(0623.17067x0623.17067y A point prediction
40
Interval Estimates• Two intervals can be used to discover how closely the
predicted value will match the true value of y.– Prediction interval – predicts y for a given value of x,– Confidence interval – predicts the average y for a given x.
– The confidence interval– The confidence interval
2x
2g
2 s)1n()xx(
n1
sty
2x
2g
2 s)1n()xx(
n1
sty
– The prediction interval– The prediction interval
2x
2g
2 s)1n()xx(
n1
1sty
2x
2g
2 s)1n()xx(
n1
1sty
41
Interval Estimates,Example
• Example 5 - continued – Provide an interval estimate for the bidding price on
a Ford Taurus with 40,000 miles on the odometer.– Two types of predictions are required:
• A prediction for a specific car• A prediction for the mean price per car
42
Interval Estimates,Example
• Solution– A prediction interval provides the price estimate for a
single car:
2x
2g
1n,2 s)1n(
)xx(
n1
1sty
65214.5741)(43.509)(10036.011)(40
1001
15)1.984(.326.0669(40)][17.0252
xg
t.025,98
Approximately
t/2.n-1 s 2xs
xxg
b1b0
43
• Solution – continued– A confidence interval provides the estimate of the
mean price per car for a Ford Taurus with 40,000 miles reading on the odometer.
• The confidence interval (95%) =
2
i
2g
2)xx(
)xx(
n1
sty
.07614.5741)(43.509)(10036.011)(40
1001
5)1.984(.32614.5742
Interval Estimates,Example