10 inference for regression part2

8/17/2019 10 Inference for Regression Part2

1/12

- 1 -

INFERENCE FOR REGRESSION – PART 2

Topics Outline

Analysis of Variance (ANOVA)

F Test for the Slope

Calculating Confidence and Prediction Intervals

Normal Probability Plots

Analysis of Variance (ANOVA)

Question: If the least squares fit is the best fit, how good is it?

The answer to this question depends on the variability in the values of the response variable,

that is on the deviations of the observed y’s from their mean y :

Figure 1 Decomposition of total variation

The total variation y yi for an observed i y can be decomposed into two parts:

total = regression + error y yi = y yiˆ + ii y y ˆ

totalvariation

=variation explained by the regression line

+unexplainedvariation

If we square these deviations and add them up, we obtain the following three sources ofvariability:


2/12

- 2 -

Total Sumof Squares

=Regression Sumof Squares

+Error Sumof Squares

n

i

i y y1

2)( =n

i

i y y1

2)ˆ( +n

i

ii y y1

2)ˆ(

SST = SSR + SSE

If we divide SST by n – 1, we get the sample variance of the observations n y y y ,,, 21 :

n

i

i y SST n

y yn

s1

22

1

1

1

1

So, the total sum of squares SST really is a measure of total variation.It has n – 1 degrees of freedom.

The regression sum of squares SSR represents variation due to the relationship between x and y.

It has 1 degree of freedom.

The error sum of squares SSE measures the amount of variability in the response variable dueto factors other than the relationship between x and y.It has n – 2 degrees of freedom.

For the car plant electricity usage example (see Excel output on pages 5 and 6),

SST = SSR + SSE

1.5115 = 1.2124 + 0.2991

By themselves, SSR, SSE, and SST provide little that can be directly interpreted.However, a simple ratio of the regression sum of squares SSR to the total sum of squares SST provides a measure of the goodness of fit for the estimated regression equation. This ratio is the

coefficient of determination:

SST

SSE

SST

SSRr 12

The coefficient of determination is the proportion of the total sum of squares that can be

explained by the sum of squares due to regression. In other words, 2r measures the proportion ofvariation in the response variable y that can be explained by y’s linear dependence on x in the

regression model.

For our data,

802.05115.1

2124.12

SST

SSRr

This high proportion of explained variation indicates that the estimated regression equation provides a good fit and can be very useful for predictions of electricity usage.


3/12

- 3 -

If we divide the sum of squares for regression and error by their degrees of freedom,we obtain the regression mean square MSR and the error mean square MSE:

SSRSSR

MSR1

(variance due to regression)

2nSSE MSE (variance due to error)

Taking a square root of the variance due to error we obtain the regression standard erroror standard error of estimate:

2n

SSE MSE se

Note: Recall that2

ˆ

1

2

n

y y

s

n

i ii

e is the standard deviation of the residuals and it estimates

the standard deviation of the errors in the model.

For our example,

2124.11

SSR MSR

02991.0

212

2991.0

2n

SSE MSE

F Test for the Slope

The ratio of the mean squares provides the F- statistic

MSE

MSR F

For our example,

53.4002991.0

2124.1

MSE

MSR F

It can be proved that the F -statistic follows an F-distri bution with 1 and (n – 2) degrees offreedom and can be used to test the hypothesis for a linear relationship between x and y.


4/12

- 4 -

Recall that represents the slope of the true unknown regression line

x y E y)(

The null and alternative hypotheses for the slope are stated as follows:

0:0 H 0:a H

If the null hypothesis is rejected, we would conclude that there is evidence of a linear relationship.

For our example, the corresponding F distribution has df1 = 1 and df2 = n – 2 = 12 – 2 = 10degrees of freedom. To calculate the P -value associated with the value F = 40.53 of the teststatistic, Excel can be used as follows:

P -value = FDIST(test statistic,df1,df2) = FDIST(40.53,1,10) = 0.000082

Figure 2 Testing for significance of slope using F distribution with 1 and 10 degrees of freedom

With P -value this small, we reject the null hypothesis and conclude that there is a significantlinear relationship between the electricity usage and the production levels.

Notice that the P -value = 0.000082 for the F test of the slope is the same as the P -value for thet test of the slope performed earlier. Moreover, it can be shown that the square of t distributionwith n – 2 degrees of freedom equals the F distribution with 1 and n – 2 degrees of freedom:

2,1

2

2 nn F t

For our example,

F SE

bt

b

53.403665267.6078352.0

4988301.0 222

2

With only one explanatory variable, the F test will provide the same conclusion as the t test.But with more than one explanatory variable, only the F test can be used to test for an overallsignificant relationship.


5/12

- 5 -

Example 1 (Car plant electricity usage)The manager of a car plant wishes to investigate how the plant’s electricity usage depends upon

the plant’s production, based on the data for each month of the previous year: x y

Month Production Electricity usage

January 4.51 2.48February 3.58 2.26March 4.31 2.47April 5.06 2.77May 5.64 2.99June 4.99 3.05July 5.29 3.18August 5.83 3.46September 4.70 3.03October 5.61 3.26 November 4.90 2.67

December 4.20 2.53

df SS MS F Significance F

Regression 1 1.212382 1.212382 40.532970 0.000082

Residual 10 0.299110 0.029911

Total 11 1.511492

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 0.409048 0.385991 1.059736 0.314190 -0.450992 1.269089

Production 0.498830 0.078352 6.366551 0.000082 0.324252 0.673409

y = 0.4988x + 0.409

R² = 0.8021

22.25

2.5

2.75

3

3.25

3.5

3.5 4 4.5 5 5.5 6

E l e c t r i c i t y u

s a g e ( m i l l i o n k W h )

Production ($ million)

Car Plant Electricity Usage

Regression Statistics

Multiple R 0.895606

R Square 0.802109

Adjusted R Square 0.782320

Standard Error 0.172948

Observations 12


6/12

- 6 -

Interpretation of Excel’s Regression Output


Multiple R 0.895606 (sign of b) 2r

R Square 0.802109 SST

SSE

SST

SSRr 1

2


1

212

n

SST n

SSE

r ad j

Standard Error 0.1729482n

SSE MSE se

Observations 12 n

ANOVA

Source of

Variation

Degrees

of Freedom

Sum

of Squares

Mean Square

(Variance) F-statistic P-value

Regression 1 SSR1

SSR MSR

MSE

MSR F Prob > F

Error 2n SSE2n

SSE MSE

Total 1n SST


Regression 1 1.212382 1.212382 40.532970 0.000082

Residual 10 0.299110 0.029911

Total 11 1.511492

Coefficients Std Error t Stat P-value Lower 95% Upper 95%

y-intercept a SE aaSE

a Prob>| t | a – t *SE a a + t *SE a

x b SE bbSE

b Prob>| t | b – t *SE b b + t *SE b

Coefficients Std Error t Stat P-value Lower 95% Upper 95%

Intercept 0.409048 0.385991 1.059736 0.314190 -0.450992 1.269089

Production 0.498830 0.078352 6.366551 0.000082 0.324252 0.673409


7/12

- 7 -

Calculating Confidence and Prediction Intervals

StatTools provides prediction intervals for individual values, but it does not provide confidenceintervals for the mean of y, given a set of x’s.

The Excel file ConfPredInt.xlsx (located in the Materials folder on Blackboard) computes theconfidence interval and the prediction interval for the Car Plant Electricity Usage data.The same file can be used to calculate the confidence and prediction intervals for any other dataafter providing the information in the cells filled with yellow color.

Reminder: The prediction interval is always wider than the confidence interval because there ismuch more variation in predicting an individual value than in estimating a mean value.

Normal Probability Plots

The third regression assumption states that the error terms are normally distributed.You can evaluate this assumption using a histogram or a normal probability plot of the residuals.

One common normal probability plot is called the quantile-quantile or Q-Q plot.

It is a scatter plot of the standardized values from the data set being evaluated versus the values thatwould be expected if the data were perfectly normally distributed (with the same mean and standarddeviation as in the data set). If the data are, in fact, normally distributed, the points in this plot tend to

cluster around a o45 line. Any large deviation from a o45 line signals some type of non-normality.

The following figure illustrates the typical shape of normal probability plots.If the data are left-skewed, the curve will rise more rapidly at first and then level off.If the data are normally distributed, the points will plot along an approximately straight line.If the data are right-skewed, the data will rise more slowly at first and then rise at a faster rate forhigher values of the variable being plotted.

Figure 3 Normal probability plots for (a) left-skewed,(b) normal, and (c) right-skewed distributions.

Regression analysis procedures are very robust with modest departures from normality.Unless the distribution of the residuals is severely non-normal, the inferences made from theregression output are still approximately valid. In addition, some forms on non-normality canoften be remedied by transformations of the response variable.

Although the Regression tool in Excel offers a normal probability plot, this is not the appropriate probability plot for the residuals, so do not use this option. With StatTools, select Q-Q Normal Plotfrom the Normality Tests dropdown list and check both options at the bottom of the dialog box.


8/12

- 8 -

Example 2

Sunflowers Apparel

The sales for Sunflowers Apparel, a chain of upscale clothing stores for women, have increasedduring the past 12 years as the chain has expanded the number of stores. Until now, Sunflowersmanagers selected sites based on subjective factors, such as the availability of a good lease or the

perception that a location seemed ideal for an apparel store.The new director of planning wants to develop a systematic approach that will lead to making better decisions during the site selection process. He believes that the size of the store significantlycontributes to store sales, and wants to use this relationship in the decision-making process.

To examine the relationship between the store size and its annual sales, data were collected froma sample of 14 stores. The data are stored in Sunflowers_Apparel.xlsx .

(a) Use Excel’s Regression tool or StatTools to run a linear regression.Does a straight line provide a useful mathematical model for this relationship?


Multiple R 0.950883R Square 0.904179


Standard Error 0.966380

Observations 14

ANOVA


Regression 1 105.747610 105.747610 113.233513 0.00000018

Residual 12 11.206676 0.933890

Total 13 116.954286

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 0.964474 0.526193 1.832927 0.091727 -0.182003 2.11095

Square Feet 1.669862 0.156925 10.641124 0.000000 1.327951 2.01177

y = 1.6699x + 0.9645

R² = 0.9042

0

2

4

6

8

10

12

14

0 2 4 6 8

A n n u a l S a l e s ( $ m i l l i o n )

Square Feet (thousands)

Square Feet Line Fit Plot


9/12

- 9 -

As the scatter plot and the high r = 0.95 show, as the size of the store increases, annual salesincrease approximately as a straight line. Thus, we can assume that a straight line provides auseful mathematical model for this relationship.

(b) Can you safely predict the annual sales for a store whose size is 7 thousands of square feet?

The square footage varies from 1.1 to 5.8 thousands of square feet. Therefore annual salesshould be predicted only for stores whose size is between 1.1 to 5.8 thousands of square feet.It would be improper to use the prediction line to forecast the sales for a new store containing7,000 square feet because the relationship between sales and store size has a point ofdiminishing returns. If that is true, as square footage increases beyond 5,800 square feet, theeffect on sales becomes smaller and smaller.

(c) What are the values of SST, SSR, and SSE? Please verify that SST = SSR + SSE.

SST = 116.954286

SSR = 105.747610SSE = 11.206676

116.954286 = 105.747610 + 11.206676

(d) Use the above sums to calculate the coefficient of determination and interpret it.

904179.0954286.116

747610.1052

SST

SSRr

Therefore, 90.42% of the variation in annual sales is explained by the variability in the size ofthe store as measured by the square footage. This large r 2 indicates a strong linear relationship

between these two variables because the use of a regression model has reduced the variability in predicting annual sales by 90.42%. Only 9.58% of the sample variability in annual sales is due tofactors other than what is accounted for by the linear regression model that uses square footage.

(e) Interpret the standard error of the estimate.

s = 0.966380

Recall that the standard error of the estimate represents a measure of the variation around the prediction line. It is measured in the same units as the dependent variable y.Here, the typical difference between actual annual sales at a store and the predicted annual salesusing the regression equation is approximately 0.966380 millions of dollars or $966,380.

(f) What are the expected annual sales and the residual value for the last data pair ( x = 3, y = 4.1).Interpret both of these values in business terms.

From the regression output, the expected (or predicted) annual sales equal 5.974061,indicating that we expect annual sales to be $5,974,061, on average, for a store with size of3,000 square feet. The residual equals – 1.874061, indicating that for the store correspondingto the last pair in the data set the actual annual sales were $1,874,061 lower than expected.


10/12

- 10 -

(g) Construct a residual plot.

(h) Use the residual plot to evaluate the regression model assumptions about linearity(mean of zero), independence, and equal spread of the residuals.

LinearityAlthough there is widespread scatter in the residual plot, there is no clear pattern orrelationship between the residuals and xi’s. The residuals appear to be evenly spread aboveand below 0 for different values of x.

IndependenceYou can evaluate the assumption of independence of the errors by plotting the residuals in

the order or sequence in which the data were collected. If the values of y are part of a timeseries, one residual may sometimes be related to the previous residual.If this relationship exists between consecutive residuals (which violates the assumption ofindependence), the plot of the residuals versus the time in which the data were collected willoften show a cyclical pattern. Because the Sunflowers Apparel data were collected during thesame time period, you can assume that the independence assumption is satisfied for these data.

Equal spreadThere do not appear to be major differences in the variability of the residuals for different xi values. Thus, you can conclude that there is no apparent violation in the assumption of equalspread at each level of x.

-2.5

-2

-1.5

-1

-0.5

0

0.51

1.5

0 2 4 6 8

R e s i d u a l s

Square Feet

Square Feet Residual Plot


11/12

- 11 -

(i) Use StatTools to construct a normal probability plot and evaluate the regression modelassumption about normality of the residuals.

From the Q – Q plot of the residuals, the data do not appear to depart substantially from anormal distribution. The robustness of regression analysis with modest departures fromnormality enables you to conclude that you should not be overly concerned about departuresfrom this normality assumption in the Sunflowers Apparel data.

Note: Excel does not readily provide a normal probability plot of residuals, but could be used toget it in the following way. Run a regression with y = Residuals, x = any numbers (for example x = 1,2,3, ...) and check the Normal Probability Plots box. Here is the result for our example.

(j) Find the standard error of the slope coefficient. What does this number indicate?

The standard error of the slope coefficient indicates the uncertainty in the estimated slope.It measures about how far the estimated slope (the regression coefficient computed from the

sample) differs from the (idealized) true population slope, , due to the randomness of

sampling. Here, the estimated slope b = 1.669862 or $1,669,862 differs from the population

slope by about bSE = 0.156925 or $156,925.

-3.5

-2.5

-1.5

-0.5

0.5

1.5

2.5

3.5

-3.5 -2.5 -1.5 -0.5 0.5 1.5 2.5 3.5

S t a n d a r d i z e d Q - V a l u e

Z-Value

Q-Q Normal Plot of Residual / Data Set #2

-3

-2

-1

0

1

2

0 20 40 60 80 100 120

R e s i d u a l s

Sample Percentile

Normal Probability Plot


12/12

- 12 -

(k) Find and interpret the 95% confidence interval for the slope coefficient.(Note: In economics language for the slope, this question sounds like the following.Find the 95% confidence interval for the expected marginal value of an additional 1,000square feet to Sunflowers Apparel.)

The 95% confidence interval extends from 1.327951 to 2.011773 or $1,327,951 to $2,011,773.We are 95% confident that the additional annual sales, for an additional 1,000 square feet instore size, are between $1,327,951 to $2,011,773, on average.(In economics language: We are 95% sure that the expected marginal value of an additional1,000 square feet is between $1,327,951 to $2,011,773.)

(l) Use the F statistic from the regression output to determine whether the slope is statisticallysignificant.

The ANOVA table shows that the computed test statistic is F = 113.2335 and the P -value is

approximately 0. Therefore, you reject the null hypothesis 0:0 H and conclude that the

size of the store is significantly related to annual sales.

(m) Use the t statistic from the regression output to determine whether the slope is statisticallysignificant.

The slope is significantly different from 0. This can be seen either from the P -value of thetest statistic (for t = 10.6411, P -value 0) or from the confidence interval for the slope(1.33 to 2.01) which does not include 0. This says that the size of the store significantlycontributes to store annual sales. Stores with larger size make larger annual sales, on average.

(n) Use the ConfPredInt.xlsx to construct a 95% confidence interval of the mean annual sales

for the entire population of stores that contain 4,000 square feet ( x = 4).

The confidence interval is 6.971119 to 8.316727. Therefore, the mean annual sales are between $6,971,119 and $8,316,727 for the population of stores with 4,000 square feet.

(o) Use the ConfPredInt.xlsx to construct a 95% prediction interval of the annual sales for anindividual store that contains 4,000 square feet ( x = 4).

The prediction interval is 5.433482 to 9.854364. Therefore, with 95% confidence, you predictthat the annual sales for an individual store with 4,000 square feet is between $5,433,482 and$9,854,364.

(p) Compare the intervals constructed in (n) and (o).

If you compare the results of the confidence interval estimate and the prediction intervalestimate, you see that the width of the prediction interval for an individual store is much widerthan the confidence interval estimate for the mean.

10 inference for regression part2

Documents