10 - regression 1
TRANSCRIPT
-
7/29/2019 10 - Regression 1
1/58
Simple Linear Regression
Simple Linear Regression Model
Least Squares Method
Coefficient of Determination
Model Assumptions
Testing for Significance
Using the Estimated Regression Equation
for Estimation and Prediction
Residual Analysis: Validating Model Assumptions
Outliers and Influential Observations
-
7/29/2019 10 - Regression 1
2/58
Simple Linear Regression
Regression analysis can be used to develop anequation showing how the variables are related.
Managerial decisions often are based on therelationship between two or more variables.
The variables being used to predict the value of thedependent variable are called the independent
variables and are denoted by x.
The variable being predicted is called the dependentvariable and is denoted by y.
-
7/29/2019 10 - Regression 1
3/58
Simple Linear Regression
The relationship between the two variables isapproximated by a straight line.
Simple linear regression involves one independentvariable and one dependent variable.
Regression analysis involving two or moreindependent variables is called multiple regression.
-
7/29/2019 10 - Regression 1
4/58
Simple Linear Regression Model
y =b0 +b1x +e
where:
b0 andb1 are called parameters of the model,
e is a random variable called the error term.
The simple linear regression model is:
The equation that describes how y is related to x andan error term is called the regression model.
-
7/29/2019 10 - Regression 1
5/58
Simple Linear Regression Equation
The simple linear regression equation is:
E(y) is the expected value of y for a given x value.
b1 is the slope of the regression line.
b0 is the y intercept of the regression line.
Graph of the regression equation is a straight line.
E(y) =b0 +b1x
-
7/29/2019 10 - Regression 1
6/58
Simple Linear Regression Equation
Positive Linear Relationship
E(y)
x
Slopeb1is positive
Regression line
Interceptb0
-
7/29/2019 10 - Regression 1
7/58
Simple Linear Regression Equation
Negative Linear Relationship
E(y)
x
Slopeb1is negative
Regression lineIntercept
b0
-
7/29/2019 10 - Regression 1
8/58
Simple Linear Regression Equation
No Relationship
E(y)
x
Slopeb1is 0
Regression line
Interceptb0
-
7/29/2019 10 - Regression 1
9/58
Estimated Simple Linear Regression Equation
The estimated simple linear regression equation
0 1y b b x
is the estimated value of y for a given x value.y b1 is the slope of the line. b0 is the y intercept of the line. The graph is called the estimated regression line.
-
7/29/2019 10 - Regression 1
10/58
Estimation Process
Regression Modely =b0 +b1x +e
Regression EquationE(y) =b0 +b1x
Unknown Parametersb0,b1
Sample Data:
x yx1 y1. .
. .xn yn
b0
and b1
provide estimates ofb0 andb1
EstimatedRegression Equation
Sample Statisticsb0, b1
0 1y b b x
-
7/29/2019 10 - Regression 1
11/58
Least Squares Method
Least Squares Criterion
min (y yi i )2
where:
yi = observed value of the dependent variablefor the ith observation
^yi = estimated value of the dependent variable
for the ith observation
-
7/29/2019 10 - Regression 1
12/58
Slope for the Estimated Regression Equation
1 2
( )( )
( )
i i
i
x x y yb
x x
Least Squares Method
where:xi = value of independent variable for ith
observation
_y = mean value for dependent variable
_x = mean value for independent variable
yi = value of dependent variable for ith
observation
-
7/29/2019 10 - Regression 1
13/58
y-Intercept for the Estimated Regression Equation
Least Squares Method
0 1b y b x
-
7/29/2019 10 - Regression 1
14/58
Reed Auto periodically has
a special week-long sale.
As part of the advertising
campaign Reed runs one ormore television commercials
during the weekend preceding the sale. Data from a
sample of 5 previous sales are shown on the next slide.
Simple Linear Regression
Example: Reed Auto Sales
-
7/29/2019 10 - Regression 1
15/58
Simple Linear Regression
Example: Reed Auto Sales
Number ofTV Ads (x)
Number ofCars Sold (y)
1
3213
14
24181727
Sx = 10 Sy = 1002x 20y
-
7/29/2019 10 - Regression 1
16/58
Estimated Regression Equation
10 5y x
1 2
( )( ) 205
( ) 4i i
i
x x y yb
x x
0 1 20 5(2) 10b y b x
Slope for the Estimated Regression Equation
y-Intercept for the Estimated Regression Equation
Estimated Regression Equation
-
7/29/2019 10 - Regression 1
17/58
Scatter Diagram and Trend Line
y = 5x + 10
0
5
10
15
20
25
30
0 1 2 3 4TV Ads
Cars
Sold
-
7/29/2019 10 - Regression 1
18/58
Coefficient of Determination
Relationship Among SST, SSR, SSE
where:
SST = total sum of squares
SSR = sum of squares due to regression
SSE = sum of squares due to error
SST = SSR + SSE
2( )iy y2
( )iy y 2
( )i iy y
-
7/29/2019 10 - Regression 1
19/58
The coefficient of determination is:
Coefficient of Determination
where:
SSR = sum of squares due to regressionSST = total sum of squares
r2 = SSR/SST
-
7/29/2019 10 - Regression 1
20/58
Coefficient of Determination
r2 = SSR/SST = 100/114 = .8772
The regression relationship is very strong; 87.7%
of the variability in the number of cars sold can be
explained by the linear relationship between thenumber of TV ads and the number of cars sold.
-
7/29/2019 10 - Regression 1
21/58
Sample Correlation Coefficient
2
1 )of(sign rbrxy
ionDeterminatoftCoefficien)of(sign 1brxy
where: b1 = the slope of the estimated regression
equation xbby 10
-
7/29/2019 10 - Regression 1
22/58
21 )of(sign rbrxy
The sign of b1 in the equation is +. 10 5y x
= + .8772xyr
Sample Correlation Coefficient
rxy = +.9366
-
7/29/2019 10 - Regression 1
23/58
Assumptions About the Error Term e
1. The error e is a random variable with mean of zero.
2. The variance of e, denoted by 2, is the same forall values of the independent variable.
3. The values of e are independent.
4. The error e is a normally distributed randomvariable.
-
7/29/2019 10 - Regression 1
24/58
Testing for Significance
To test for a significant regression relationship, wemust conduct a hypothesis test to determine whetherthe value ofb1 is zero.
Two tests are commonly used:t Test and FTest
Both the t test and Ftest require an estimate of 2,the variance of e in the regression model.
-
7/29/2019 10 - Regression 1
25/58
An Estimate of 2
Testing for Significance
2
10
2)()(SSE iiii xbbyyy
where:
s2 = MSE = SSE/(n 2)
The mean square error (MSE) provides the estimate
of 2, and the notation s2 is also used.
-
7/29/2019 10 - Regression 1
26/58
Testing for Significance
An Estimate of
2
SSEMSE
ns
To estimate we take the square root of 2.
The resulting s is called the standard error ofthe estimate.
-
7/29/2019 10 - Regression 1
27/58
Hypotheses
Test Statistic
Testing for Significance: t Test
0 1: 0H b
1: 0aH b
1
1
b
bt
s where
1 2
( )
b
i
ss
x x
S
-
7/29/2019 10 - Regression 1
28/58
Rejection Rule
Testing for Significance: t Test
where:
tis based on a t distribution
with n - 2 degrees of freedom
Reject H0 ifp-value < or t < -t or t > t
-
7/29/2019 10 - Regression 1
29/58
1. Determine the hypotheses.
2. Specify the level of significance.
3. Select the test statistic.
= .05
4. State the rejection rule. Reject H0 ifp-value < .05or |t| > 3.182 (with3 degrees of freedom)
Testing for Significance: t Test
0 1: 0H b 1: 0aH b
1
1
b
bt
s
-
7/29/2019 10 - Regression 1
30/58
Testing for Significance: t Test
5. Compute the value of the test statistic.
6. Determine whether to reject H0.
t = 4.541 provides an area of .01 in the uppertail. Hence, thep-value is less than .02. (Also,t = 4.63 > 3.182.) We can reject H0.
1
1 5 4.631.08b
bt
s
-
7/29/2019 10 - Regression 1
31/58
Confidence Interval forb1
H0 is rejected if the hypothesized value of b1 is notincluded in the confidence interval for b1.
We can use a 95% confidence interval forb1 to test
the hypotheses just used in the t test.
-
7/29/2019 10 - Regression 1
32/58
The form of a confidence interval forb1 is:
Confidence Interval forb1
11 /2 bb t s
where is the t value providing an area
of /2 in the upper tail of a t distributionwith n - 2 degrees of freedom
2/t
b1 is thepoint
estimator
is themarginof error
1/2 bt s
-
7/29/2019 10 - Regression 1
33/58
Confidence Interval forb1
Reject H0 if 0 is not included in
the confidence interval for b1.
0 is not included in the confidence interval.Reject H0
= 5 +/- 3.182(1.08) = 5 +/- 3.4412/1 bstb or 1.56 to 8.44
Rejection Rule
95% Confidence Interval forb1
Conclusion
-
7/29/2019 10 - Regression 1
34/58
Hypotheses
Test Statistic
Testing for Significance: F Test
F= MSR/MSE
0 1: 0H b
1: 0aH b
-
7/29/2019 10 - Regression 1
35/58
Rejection Rule
Testing for Significance: F Test
where:Fis based on an Fdistribution with
1 degree of freedom in the numerator and
n - 2 degrees of freedom in the denominator
Reject H0 ifp-value <
or F> F
-
7/29/2019 10 - Regression 1
36/58
1. Determine the hypotheses.
2. Specify the level of significance.
3. Select the test statistic.
= .05
4. State the rejection rule. Reject H0 ifp-value < .05
or F> 10.13 (with 1 d.f.in numerator and3 d.f. in denominator)
Testing for Significance: F Test
0 1: 0H b
1: 0aH b
F= MSR/MSE
-
7/29/2019 10 - Regression 1
37/58
Testing for Significance: F Test
5. Compute the value of the test statistic.
6. Determine whether to reject H0.
F= 17.44 provides an area of .025 in the uppertail. Thus, thep-value corresponding to F= 21.43is less than 2(.025) = .05. Hence, we reject H0.
F= MSR/MSE = 100/4.667 = 21.43
The statistical evidence is sufficient to conclude
that we have a significant relationship between thenumber of TV ads aired and the number of cars sold.
Some Cautions about the
-
7/29/2019 10 - Regression 1
38/58
Some Cautions about theInterpretation of Significance Tests
Just because we are able to reject H0:b
1= 0 and
demonstrate statistical significance does not enableus to conclude that there is a linear relationshipbetween x and y.
Rejecting H0:b1 = 0 and concluding that the
relationship between x and y is significant doesnot enable us to conclude that a cause-and-effectrelationship is present between x and y.
Using the Estimated Regression Equation
-
7/29/2019 10 - Regression 1
39/58
Using the Estimated Regression Equationfor Estimation and Prediction
/ y t sp yp 2
where:
confidence coefficient is 1 - and
t/2 is based on a t distribution
with n - 2 degrees of freedom
/2 indpy t s
Confidence Interval Estimate of E(yp)
Prediction Interval Estimate of yp
-
7/29/2019 10 - Regression 1
40/58
If 3 TV ads are run prior to a sale, we expect
the mean number of cars sold to be:
Point Estimation
y = 10 + 5(3) = 25 cars
C fid I l f E( )
-
7/29/2019 10 - Regression 1
41/58
2
2
( )1
( )pp
y
i
x xs s
n x x
Estimate of the Standard Deviation of py
Confidence Interval for E(yp)
2
2 2 2 2 2
(3 2)12.16025
5 (1 2) (3 2) (2 2) (1 2) (3 2)pys
1 12.16025 1.44915 4pys
C fid I l f E( )
-
7/29/2019 10 - Regression 1
42/58
The 95% confidence interval estimate of the mean
number of cars sold when 3 TV ads are run is:
Confidence Interval for E(yp)
25 + 4.61
/ y t sp yp 2
25 + 3.1824(1.4491)
20.39 to 29.61 cars
P di i I l f
-
7/29/2019 10 - Regression 1
43/58
2
ind 2
( )11
( )
p
i
x xs s
n x x
Estimate of the Standard Deviation
of an Individual Value of yp
1 12.16025 1
5 4pys
2.16025(1.20416) 2.6013pys
Prediction Interval for yp
P di ti I t l f
-
7/29/2019 10 - Regression 1
44/58
The 95% prediction interval estimate of the numberof cars sold in one particular week when 3 TV adsare run is:
Prediction Interval for yp
25 + 8.28
25 + 3.1824(2.6013)
/2 indpy t s
16.72 to 33.28 cars
R id l A l i
-
7/29/2019 10 - Regression 1
45/58
Residual Analysis
i iy y
Much of the residual analysis is based on anexamination of graphical plots.
Residual for Observation i
The residuals provide the best information about e.
If the assumptions about the error term e appear
questionable, the hypothesis tests about thesignificance of the regression relationship and theinterval estimation results may not be valid.
R id l Pl t A i t
-
7/29/2019 10 - Regression 1
46/58
Residual Plot Against x
If the assumption that the variance of e is the same
for all values of x is valid, and the assumedregression model is an adequate representation of therelationship between the variables, then
The residual plot should give an overall
impression of a horizontal band of points
R id l Pl t A i t
-
7/29/2019 10 - Regression 1
47/58
x
y y
0
Good Pattern
Resid
ual
Residual Plot Against x
R id l Pl t A i t
-
7/29/2019 10 - Regression 1
48/58
Residual Plot Against x
x
y y
0
Resid
ual
Nonconstant Variance
R id l Pl t A i t
-
7/29/2019 10 - Regression 1
49/58
Residual Plot Against x
x
y y
0
Resid
ual
Model Form Not Adequate
Residual Plot Against x
-
7/29/2019 10 - Regression 1
50/58
Residuals
Residual Plot Against x
Observation Predicted Cars Sold Residuals
1 15 -1
2 25 -1
3 20 -2
4 15 2
5 25 2
Residual Plot Against x
-
7/29/2019 10 - Regression 1
51/58
Residual Plot Against x
TV Ads Residual Plot
-3
-2
-1
0
1
2
3
0 1 2 3 4TV Ads
Re
siduals
Standardized Residuals
-
7/29/2019 10 - Regression 1
52/58
Standardized Residual for Observation i
Standardized Residuals
i i
i i
y y
y y
s
1
i i iy ys s h
2
2
( )1
( )i
i
i
x xh n x x
where:
Standardized Residual Plot
-
7/29/2019 10 - Regression 1
53/58
Standardized Residual Plot
The standardized residual plot can provide insight
about the assumption that the error term e has anormal distribution.
If this assumption is satisfied, the distribution of thestandardized residuals should appear to come from a
standard normal probability distribution.
Standardized Residual Plot
-
7/29/2019 10 - Regression 1
54/58
Standardized Residuals
Standardized Residual Plot
Observation Predicted Y Residuals Standard Residuals1 15 -1 -0.5352 25 -1 -0.5353 20 -2 -1.0694 15 2 1.0695 25 2 1.069
Standardized Residual Plot
-
7/29/2019 10 - Regression 1
55/58
Standardized Residual Plot
Standardized Residual Plot
A B C D
28
29 RESIDUAL OUTPUT
30
31 Observation Predicted Y Residuals dard Resid32 1 15 -1 -0.534522
33 2 25 -1 -0.534522
34 3 20 -2 -1.069045
354 15 2 1.069045
36 5 25 2 1.069045
37
-1.5
-1
-0.5
0
0.5
1
1.5
0 10 20 30
Cars Sold
StandardR
esidual
Standardized Residual Plot
-
7/29/2019 10 - Regression 1
56/58
Standardized Residual Plot
All of the standardized residuals are between 1.5
and +1.5 indicating that there is no reason to questionthe assumption that e has a normal distribution.
Outliers and Influential Observations
-
7/29/2019 10 - Regression 1
57/58
Outliers and Influential Observations
Detecting Outliers
An outlier is an observation that is unusual incomparison with the other data.
Minitab classifies an observation as an outlier if itsstandardized residual value is < -2 or > +2.
This standardized residual rule sometimes fails toidentify an unusually large observation as beingan outlier.
This rules shortcoming can be circumvented byusing studentized deleted residuals.
The |i th studentized deleted residual| will belarger than the |i th standardized residual|.
-
7/29/2019 10 - Regression 1
58/58