10 inference for regression part2
TRANSCRIPT
-
8/17/2019 10 Inference for Regression Part2
1/12
- 1 -
INFERENCE FOR REGRESSION – PART 2
Topics Outline
Analysis of Variance (ANOVA)
F Test for the Slope
Calculating Confidence and Prediction Intervals
Normal Probability Plots
Analysis of Variance (ANOVA)
Question: If the least squares fit is the best fit, how good is it?
The answer to this question depends on the variability in the values of the response variable,
that is on the deviations of the observed y’s from their mean y :
Figure 1 Decomposition of total variation
The total variation y yi for an observed i y can be decomposed into two parts:
total = regression + error y yi = y yiˆ + ii y y ˆ
totalvariation
=variation explained by the regression line
+unexplainedvariation
If we square these deviations and add them up, we obtain the following three sources ofvariability:
-
8/17/2019 10 Inference for Regression Part2
2/12
- 2 -
Total Sumof Squares
=Regression Sumof Squares
+Error Sumof Squares
n
i
i y y1
2)( =n
i
i y y1
2)ˆ( +n
i
ii y y1
2)ˆ(
SST = SSR + SSE
If we divide SST by n – 1, we get the sample variance of the observations n y y y ,,, 21 :
n
i
i y SST n
y yn
s1
22
1
1
1
1
So, the total sum of squares SST really is a measure of total variation.It has n – 1 degrees of freedom.
The regression sum of squares SSR represents variation due to the relationship between x and y.
It has 1 degree of freedom.
The error sum of squares SSE measures the amount of variability in the response variable dueto factors other than the relationship between x and y.It has n – 2 degrees of freedom.
For the car plant electricity usage example (see Excel output on pages 5 and 6),
SST = SSR + SSE
1.5115 = 1.2124 + 0.2991
By themselves, SSR, SSE, and SST provide little that can be directly interpreted.However, a simple ratio of the regression sum of squares SSR to the total sum of squares SST provides a measure of the goodness of fit for the estimated regression equation. This ratio is the
coefficient of determination:
SST
SSE
SST
SSRr 12
The coefficient of determination is the proportion of the total sum of squares that can be
explained by the sum of squares due to regression. In other words, 2r measures the proportion ofvariation in the response variable y that can be explained by y’s linear dependence on x in the
regression model.
For our data,
802.05115.1
2124.12
SST
SSRr
This high proportion of explained variation indicates that the estimated regression equation provides a good fit and can be very useful for predictions of electricity usage.
-
8/17/2019 10 Inference for Regression Part2
3/12
- 3 -
If we divide the sum of squares for regression and error by their degrees of freedom,we obtain the regression mean square MSR and the error mean square MSE:
SSRSSR
MSR1
(variance due to regression)
2nSSE MSE (variance due to error)
Taking a square root of the variance due to error we obtain the regression standard erroror standard error of estimate:
2n
SSE MSE se
Note: Recall that2
ˆ
1
2
n
y y
s
n
i ii
e is the standard deviation of the residuals and it estimates
the standard deviation of the errors in the model.
For our example,
2124.11
SSR MSR
02991.0
212
2991.0
2n
SSE MSE
F Test for the Slope
The ratio of the mean squares provides the F- statistic
MSE
MSR F
For our example,
53.4002991.0
2124.1
MSE
MSR F
It can be proved that the F -statistic follows an F-distri bution with 1 and (n – 2) degrees offreedom and can be used to test the hypothesis for a linear relationship between x and y.
-
8/17/2019 10 Inference for Regression Part2
4/12
- 4 -
Recall that represents the slope of the true unknown regression line
x y E y)(
The null and alternative hypotheses for the slope are stated as follows:
0:0 H 0:a H
If the null hypothesis is rejected, we would conclude that there is evidence of a linear relationship.
For our example, the corresponding F distribution has df1 = 1 and df2 = n – 2 = 12 – 2 = 10degrees of freedom. To calculate the P -value associated with the value F = 40.53 of the teststatistic, Excel can be used as follows:
P -value = FDIST(test statistic,df1,df2) = FDIST(40.53,1,10) = 0.000082
Figure 2 Testing for significance of slope using F distribution with 1 and 10 degrees of freedom
With P -value this small, we reject the null hypothesis and conclude that there is a significantlinear relationship between the electricity usage and the production levels.
Notice that the P -value = 0.000082 for the F test of the slope is the same as the P -value for thet test of the slope performed earlier. Moreover, it can be shown that the square of t distributionwith n – 2 degrees of freedom equals the F distribution with 1 and n – 2 degrees of freedom:
2,1
2
2 nn F t
For our example,
F SE
bt
b
53.403665267.6078352.0
4988301.0 222
2
With only one explanatory variable, the F test will provide the same conclusion as the t test.But with more than one explanatory variable, only the F test can be used to test for an overallsignificant relationship.
-
8/17/2019 10 Inference for Regression Part2
5/12
- 5 -
Example 1 (Car plant electricity usage)The manager of a car plant wishes to investigate how the plant’s electricity usage depends upon
the plant’s production, based on the data for each month of the previous year: x y
Month Production Electricity usage
January 4.51 2.48February 3.58 2.26March 4.31 2.47April 5.06 2.77May 5.64 2.99June 4.99 3.05July 5.29 3.18August 5.83 3.46September 4.70 3.03October 5.61 3.26 November 4.90 2.67
December 4.20 2.53
df SS MS F Significance F
Regression 1 1.212382 1.212382 40.532970 0.000082
Residual 10 0.299110 0.029911
Total 11 1.511492
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 0.409048 0.385991 1.059736 0.314190 -0.450992 1.269089
Production 0.498830 0.078352 6.366551 0.000082 0.324252 0.673409
y = 0.4988x + 0.409
R² = 0.8021
22.25
2.5
2.75
3
3.25
3.5
3.5 4 4.5 5 5.5 6
E l e c t r i c i t y u
s a g e ( m i l l i o n k W h )
Production ($ million)
Car Plant Electricity Usage
Regression Statistics
Multiple R 0.895606
R Square 0.802109
Adjusted R Square 0.782320
Standard Error 0.172948
Observations 12
-
8/17/2019 10 Inference for Regression Part2
6/12
- 6 -
Interpretation of Excel’s Regression Output
Regression Statistics
Multiple R 0.895606 (sign of b) 2r
R Square 0.802109 SST
SSE
SST
SSRr 1
2
Adjusted R Square 0.782320
1
212
n
SST n
SSE
r ad j
Standard Error 0.1729482n
SSE MSE se
Observations 12 n
ANOVA
Source of
Variation
Degrees
of Freedom
Sum
of Squares
Mean Square
(Variance) F-statistic P-value
Regression 1 SSR1
SSR MSR
MSE
MSR F Prob > F
Error 2n SSE2n
SSE MSE
Total 1n SST
df SS MS F Significance F
Regression 1 1.212382 1.212382 40.532970 0.000082
Residual 10 0.299110 0.029911
Total 11 1.511492
Coefficients Std Error t Stat P-value Lower 95% Upper 95%
y-intercept a SE aaSE
a Prob>| t | a – t *SE a a + t *SE a
x b SE bbSE
b Prob>| t | b – t *SE b b + t *SE b
Coefficients Std Error t Stat P-value Lower 95% Upper 95%
Intercept 0.409048 0.385991 1.059736 0.314190 -0.450992 1.269089
Production 0.498830 0.078352 6.366551 0.000082 0.324252 0.673409
-
8/17/2019 10 Inference for Regression Part2
7/12
- 7 -
Calculating Confidence and Prediction Intervals
StatTools provides prediction intervals for individual values, but it does not provide confidenceintervals for the mean of y, given a set of x’s.
The Excel file ConfPredInt.xlsx (located in the Materials folder on Blackboard) computes theconfidence interval and the prediction interval for the Car Plant Electricity Usage data.The same file can be used to calculate the confidence and prediction intervals for any other dataafter providing the information in the cells filled with yellow color.
Reminder: The prediction interval is always wider than the confidence interval because there ismuch more variation in predicting an individual value than in estimating a mean value.
Normal Probability Plots
The third regression assumption states that the error terms are normally distributed.You can evaluate this assumption using a histogram or a normal probability plot of the residuals.
One common normal probability plot is called the quantile-quantile or Q-Q plot.
It is a scatter plot of the standardized values from the data set being evaluated versus the values thatwould be expected if the data were perfectly normally distributed (with the same mean and standarddeviation as in the data set). If the data are, in fact, normally distributed, the points in this plot tend to
cluster around a o45 line. Any large deviation from a o45 line signals some type of non-normality.
The following figure illustrates the typical shape of normal probability plots.If the data are left-skewed, the curve will rise more rapidly at first and then level off.If the data are normally distributed, the points will plot along an approximately straight line.If the data are right-skewed, the data will rise more slowly at first and then rise at a faster rate forhigher values of the variable being plotted.
Figure 3 Normal probability plots for (a) left-skewed,(b) normal, and (c) right-skewed distributions.
Regression analysis procedures are very robust with modest departures from normality.Unless the distribution of the residuals is severely non-normal, the inferences made from theregression output are still approximately valid. In addition, some forms on non-normality canoften be remedied by transformations of the response variable.
Although the Regression tool in Excel offers a normal probability plot, this is not the appropriate probability plot for the residuals, so do not use this option. With StatTools, select Q-Q Normal Plotfrom the Normality Tests dropdown list and check both options at the bottom of the dialog box.
-
8/17/2019 10 Inference for Regression Part2
8/12
- 8 -
Example 2
Sunflowers Apparel
The sales for Sunflowers Apparel, a chain of upscale clothing stores for women, have increasedduring the past 12 years as the chain has expanded the number of stores. Until now, Sunflowersmanagers selected sites based on subjective factors, such as the availability of a good lease or the
perception that a location seemed ideal for an apparel store.The new director of planning wants to develop a systematic approach that will lead to making better decisions during the site selection process. He believes that the size of the store significantlycontributes to store sales, and wants to use this relationship in the decision-making process.
To examine the relationship between the store size and its annual sales, data were collected froma sample of 14 stores. The data are stored in Sunflowers_Apparel.xlsx .
(a) Use Excel’s Regression tool or StatTools to run a linear regression.Does a straight line provide a useful mathematical model for this relationship?
Regression Statistics
Multiple R 0.950883R Square 0.904179
Adjusted R Square 0.896194
Standard Error 0.966380
Observations 14
ANOVA
df SS MS F Significance F
Regression 1 105.747610 105.747610 113.233513 0.00000018
Residual 12 11.206676 0.933890
Total 13 116.954286
Coefficients Standard Error t Stat P-value Lower 95% Upper 95%
Intercept 0.964474 0.526193 1.832927 0.091727 -0.182003 2.11095
Square Feet 1.669862 0.156925 10.641124 0.000000 1.327951 2.01177
y = 1.6699x + 0.9645
R² = 0.9042
0
2
4
6
8
10
12
14
0 2 4 6 8
A n n u a l S a l e s ( $ m i l l i o n )
Square Feet (thousands)
Square Feet Line Fit Plot
-
8/17/2019 10 Inference for Regression Part2
9/12
- 9 -
As the scatter plot and the high r = 0.95 show, as the size of the store increases, annual salesincrease approximately as a straight line. Thus, we can assume that a straight line provides auseful mathematical model for this relationship.
(b) Can you safely predict the annual sales for a store whose size is 7 thousands of square feet?
The square footage varies from 1.1 to 5.8 thousands of square feet. Therefore annual salesshould be predicted only for stores whose size is between 1.1 to 5.8 thousands of square feet.It would be improper to use the prediction line to forecast the sales for a new store containing7,000 square feet because the relationship between sales and store size has a point ofdiminishing returns. If that is true, as square footage increases beyond 5,800 square feet, theeffect on sales becomes smaller and smaller.
(c) What are the values of SST, SSR, and SSE? Please verify that SST = SSR + SSE.
SST = 116.954286
SSR = 105.747610SSE = 11.206676
116.954286 = 105.747610 + 11.206676
(d) Use the above sums to calculate the coefficient of determination and interpret it.
904179.0954286.116
747610.1052
SST
SSRr
Therefore, 90.42% of the variation in annual sales is explained by the variability in the size ofthe store as measured by the square footage. This large r 2 indicates a strong linear relationship
between these two variables because the use of a regression model has reduced the variability in predicting annual sales by 90.42%. Only 9.58% of the sample variability in annual sales is due tofactors other than what is accounted for by the linear regression model that uses square footage.
(e) Interpret the standard error of the estimate.
s = 0.966380
Recall that the standard error of the estimate represents a measure of the variation around the prediction line. It is measured in the same units as the dependent variable y.Here, the typical difference between actual annual sales at a store and the predicted annual salesusing the regression equation is approximately 0.966380 millions of dollars or $966,380.
(f) What are the expected annual sales and the residual value for the last data pair ( x = 3, y = 4.1).Interpret both of these values in business terms.
From the regression output, the expected (or predicted) annual sales equal 5.974061,indicating that we expect annual sales to be $5,974,061, on average, for a store with size of3,000 square feet. The residual equals – 1.874061, indicating that for the store correspondingto the last pair in the data set the actual annual sales were $1,874,061 lower than expected.
-
8/17/2019 10 Inference for Regression Part2
10/12
- 10 -
(g) Construct a residual plot.
(h) Use the residual plot to evaluate the regression model assumptions about linearity(mean of zero), independence, and equal spread of the residuals.
LinearityAlthough there is widespread scatter in the residual plot, there is no clear pattern orrelationship between the residuals and xi’s. The residuals appear to be evenly spread aboveand below 0 for different values of x.
IndependenceYou can evaluate the assumption of independence of the errors by plotting the residuals in
the order or sequence in which the data were collected. If the values of y are part of a timeseries, one residual may sometimes be related to the previous residual.If this relationship exists between consecutive residuals (which violates the assumption ofindependence), the plot of the residuals versus the time in which the data were collected willoften show a cyclical pattern. Because the Sunflowers Apparel data were collected during thesame time period, you can assume that the independence assumption is satisfied for these data.
Equal spreadThere do not appear to be major differences in the variability of the residuals for different xi values. Thus, you can conclude that there is no apparent violation in the assumption of equalspread at each level of x.
-2.5
-2
-1.5
-1
-0.5
0
0.51
1.5
0 2 4 6 8
R e s i d u a l s
Square Feet
Square Feet Residual Plot
-
8/17/2019 10 Inference for Regression Part2
11/12
- 11 -
(i) Use StatTools to construct a normal probability plot and evaluate the regression modelassumption about normality of the residuals.
From the Q – Q plot of the residuals, the data do not appear to depart substantially from anormal distribution. The robustness of regression analysis with modest departures fromnormality enables you to conclude that you should not be overly concerned about departuresfrom this normality assumption in the Sunflowers Apparel data.
Note: Excel does not readily provide a normal probability plot of residuals, but could be used toget it in the following way. Run a regression with y = Residuals, x = any numbers (for example x = 1,2,3, ...) and check the Normal Probability Plots box. Here is the result for our example.
(j) Find the standard error of the slope coefficient. What does this number indicate?
The standard error of the slope coefficient indicates the uncertainty in the estimated slope.It measures about how far the estimated slope (the regression coefficient computed from the
sample) differs from the (idealized) true population slope, , due to the randomness of
sampling. Here, the estimated slope b = 1.669862 or $1,669,862 differs from the population
slope by about bSE = 0.156925 or $156,925.
-3.5
-2.5
-1.5
-0.5
0.5
1.5
2.5
3.5
-3.5 -2.5 -1.5 -0.5 0.5 1.5 2.5 3.5
S t a n d a r d i z e d Q - V a l u e
Z-Value
Q-Q Normal Plot of Residual / Data Set #2
-3
-2
-1
0
1
2
0 20 40 60 80 100 120
R e s i d u a l s
Sample Percentile
Normal Probability Plot
-
8/17/2019 10 Inference for Regression Part2
12/12
- 12 -
(k) Find and interpret the 95% confidence interval for the slope coefficient.(Note: In economics language for the slope, this question sounds like the following.Find the 95% confidence interval for the expected marginal value of an additional 1,000square feet to Sunflowers Apparel.)
The 95% confidence interval extends from 1.327951 to 2.011773 or $1,327,951 to $2,011,773.We are 95% confident that the additional annual sales, for an additional 1,000 square feet instore size, are between $1,327,951 to $2,011,773, on average.(In economics language: We are 95% sure that the expected marginal value of an additional1,000 square feet is between $1,327,951 to $2,011,773.)
(l) Use the F statistic from the regression output to determine whether the slope is statisticallysignificant.
The ANOVA table shows that the computed test statistic is F = 113.2335 and the P -value is
approximately 0. Therefore, you reject the null hypothesis 0:0 H and conclude that the
size of the store is significantly related to annual sales.
(m) Use the t statistic from the regression output to determine whether the slope is statisticallysignificant.
The slope is significantly different from 0. This can be seen either from the P -value of thetest statistic (for t = 10.6411, P -value 0) or from the confidence interval for the slope(1.33 to 2.01) which does not include 0. This says that the size of the store significantlycontributes to store annual sales. Stores with larger size make larger annual sales, on average.
(n) Use the ConfPredInt.xlsx to construct a 95% confidence interval of the mean annual sales
for the entire population of stores that contain 4,000 square feet ( x = 4).
The confidence interval is 6.971119 to 8.316727. Therefore, the mean annual sales are between $6,971,119 and $8,316,727 for the population of stores with 4,000 square feet.
(o) Use the ConfPredInt.xlsx to construct a 95% prediction interval of the annual sales for anindividual store that contains 4,000 square feet ( x = 4).
The prediction interval is 5.433482 to 9.854364. Therefore, with 95% confidence, you predictthat the annual sales for an individual store with 4,000 square feet is between $5,433,482 and$9,854,364.
(p) Compare the intervals constructed in (n) and (o).
If you compare the results of the confidence interval estimate and the prediction intervalestimate, you see that the width of the prediction interval for an individual store is much widerthan the confidence interval estimate for the mean.