12 multiple regression part2

- 1 -

MULTIPLE REGRESSION – PART 2 Topics Outline • Running Multiple Regression and Interpreting the Results • Validation of the Fit

Running Multiple Regression and Interpreting the Results Example 1 Overhead Costs at Bendrix The Bendrix Company manufactures various types of parts for automobiles. The manager of the factory wants to get a better understanding of overhead costs and has tracked total overhead costs for the past 36 months. To help explain these, he has also collected data on two variables that are related to the amount of work done at the factory (see Overhead_Costs.xlsx):

– MachHrs: number of machine hours used during the month – ProdRuns: the number of separate production runs during the month

Our earlier analysis of the two candidates for explanatory variables (see Overhead_Costs_Finished.xlsx) indicated that both variables are related to Overhead. Therefore, it makes sense to try including both in the regression equation. With any luck, the linear fit should improve. (a) Use StatTools and Overhead_Costs.xlsx to run a regression for Overhead costs as a linear function of MachHrs (machine hours) and ProdRuns (production runs). (Use Overhead_Costs_MultipleRegression_Finished.xlsx as a reference.) To obtain the regression output, select Regression from the StatTools Regression and Classification dropdown list and fill out the resulting dialog box as shown below.

- 2 -

The main regression output is:

Multiple R-Square

Adjusted StErr of

Summary R R-Square Estimate

0.9308 0.8664 0.8583 4108.993

Degrees of Sum of Mean of F-Ratio p-Value

ANOVA Table Freedom Squares Squares

Explained 2 3614020661 1807010330 107.0261 < 0.0001

Unexplained 33 557166199.1 16883824.22

Coefficient Standard

t-Value p-Value Confidence Interval 95%

Regression Table Error Lower Upper

Constant 3996.678 6603.651 0.6052 0.5492 -9438.551 17431.907

MachHrs 43.536 3.589 12.1289 < 0.0001 36.234 50.839

ProdRuns 883.618 82.251 10.7429 < 0.0001 716.276 1050.960

(b) What is the equation of the regression model?

εββα +++= 2211 xxy

(c) What is the equation of the true regression surface?

2211 xxy ββαµ ++=

Geometrically it is an equation of a plane in three-dimensional space. We refer to this plane as the plane of means.

(d) What is the equation of the fitted surface?

2211ˆ xbxbay ++=

From the regression output, the equation of the fitted surface is

Predicted Overhead = 3997 + 43.54MachHours + 883.62ProdRuns

Geometrically it is an equation of a plane in three-dimensional space. We refer to this plane as the least squares plane.

(e) Interpret the regression coefficients.

The Bendrix manager can interpret the intercept, a = $3,997, as the fixed component of overhead; that is, the overhead cost when MachHrs = 0 and ProdRuns = 0. The slope terms involving MachHrs and ProdRuns are the variable components of overhead.

If the number of production runs is held constant, the overhead cost is expected to increase by 1b = $43.54 for each extra machine hour.

If the number of machine hours is held constant, the overhead cost is expected to increase by

2b = $883.62 for each extra production run.

- 3 -

(f) From our previous analysis, the regression equations with single explanatory variables are

Predicted Overhead = 48621 + 34.7MachHrs and

Predicted Overhead = 75606 + 655.1ProdRuns

Compare these equations with the multiple regression equation.

The coefficient of MachHrs has increased from 34.7 to 43.5 and the coefficient of ProdRuns has increased from 655.1 to 883.6. Also, the intercept is now lower than either intercept in the single-variable equations.

The reasoning is that when MachHrs is the only variable in the equation, ProdRuns is not being held constant – it is being ignored – so in effect the coefficient 34.7 of MachHrs indicates the effect of MachHrs and the omitted ProdRuns on Overhead. But when both variables are included, the coefficient 43.5 of MachHrs indicates the effect of MachHrs only, holding ProdRuns constant. Because the coefficients of MachHrs in the two equations have different meanings, it is not surprising that they result in different numerical estimates.

Note: In general, it is difficult to guess the changes that will occur when more explanatory variables are included in the equation, but it is likely that changes will occur. The estimated coefficient of any explanatory variable typically depends on which other explanatory variables are included in the equation. (g) Interpret the standard error of estimate se.

Recall that se is essentially the standard deviation of residuals and estimates the true unknown standard deviation σ of the error term ε in the model. The standard error of estimate se is a measure of the typical prediction error when the multiple regression equation is used to predict the response variable.

In this example, se = $4,109. Assuming that the errors are approximately normally distributed and using the 68%–95%–99.7% empirical rule, we can conclude that about two-thirds of the predictions should be within one standard error, or $4,109, of the actual overhead cost.

(h) Compare the standard error of estimate with the standard errors from the single-variable equations for Overhead.

The comparison of the standard error se = $4,109 with the standard errors from the single-variable equations for Overhead, $8,585 and $9,457, shows that the multiple regression equation provides predictions that are more than twice as accurate as the single-variable equations – a big improvement.

(i) Interpret the coefficient of determination 2r .

The 2r value is the percentage of variation of the response variable explained by the combined set of explanatory variables. MachHrs and ProdRuns combine to explain 86.6% of the variation in Overhead.

- 4 -

(j) Compare 2r coefficients for the multiple and single regression outputs.

The 2r of 86.6% for the multiple regression is a big improvement over the single-variable equations that were able to explain only 39.9% and 27.1% of the variation in Overhead. Remarkably, the combination of the two explanatory variables explains a larger percentage than the sum of their individual effects.

Note: This is an admittedly unusual case where the total is greater than the sum of its parts. This is not common, but this example shows that it is possible. Evidently, these two explanatory variables "line up" just right to predict overhead quite well.

(k) What is the correlation r?

The square root of2r (the multiple R in the Excel output) is the correlation between the fitted values y and the observed values y of the response variable. For the Bendrix data the

correlation between them is 866.0=r = 0.931, quite high. (l) Show a graph illustrating the correlation r.

A graphical indication of the high correlation can be seen in the plot of fitted y values versus observed y values of Overhead:

(m) Is collinearity a concern for this model?

Correlation Table MachHrs ProdRuns Overhead

MachHrs 1.000

ProdRuns -0.229 1.000

Overhead 0.632 0.521 1.000

60000

70000

80000

90000

100000

110000

120000

60000 70000 80000 90000 100000 110000 120000

Fit

Overhead

Scatterplot of Fit vs Overhead

1000

1100

1200

1300

1400

1500

1600

1700

1800

10 15 20 25 30 35 40 45 50 55 60

Ma

chH

rs

ProdRuns

Scatterplot of MachHrs vs ProdRuns

- 5 -

The scatterplot of MachHrs and ProdRuns does not suggest a high correlation between these two variables. In fact, the correlation – 0.229 is pretty low: its absolute value 0.229 is lower than 0.7; it is also lower than 0.632 which is the largest of the correlations between Overhead and MachHours or ProdRuns. The variance inflation factor

( ) 055.1229.01

12

=−−

=VIF

is small as well (smaller than 10 and also smaller than the more conservative 5). Hence, collinearity is not present in this model.

(n) What inferences can be made about the regression coefficients 1b and 2b ?

Recall that each coefficient b represents a point estimate of the true, but unobservable, population parameter β , based on this particular sample. The corresponding SE indicates the accuracy of this point estimate.

For example, the point estimate of1β , the effect on Overhead of a one-unit increase in MachHrs (when ProdRuns is held constant), is 43.536 and the standard error is 3.589.

From the regression output, you can be 95% confident that the true 1β lies in the interval from 36.234 to 50.839. The 95% confidence interval for 2β is from 716.276 to 1050.960.

(o) Perform tests for the significance of the regression coefficients.

Recall that the value of the test statistic for the individual regression coefficient is the ratio of the estimated coefficient to its standard error:

SE

bt =

Therefore, it indicates how many standard errors the regression coefficient is from zero. For example, the t-value for MachHrs is about 12.13, so the regression coefficient of MachHrs, 43.536, is more than 12 of its standard errors to the right of zero. Similarly, the coefficient of ProdRuns is more than 10 of its standard errors to the right of zero. To decide whether a particular explanatory variable belongs in the regression equation, the following test is performed:

0:

0:0

≠=

ββ

aH

H

For MachHrs, the value of the test statistic is 12.13, and the associated P-value is less than 0.0001. This means that there is virtually no probability beyond the observed t-value. In words, you are still not exactly sure of the true slope coefficient 1β of MachHrs, but you

are virtually sure it is not zero. The same can be said for the true slope 2β of ProdRuns.

- 6 -

(p) How good is the overall fit?

From the regression output, the value of the F statistic is 107.0261 and the corresponding P-value is practically zero. This means that the regression equation provides a good fit.

Note: Even if the F test gives an extremely significant result, there is no guarantee that the regression equation provides a good enough fit for practical uses. For example, the Bendrix manager is trying to understand the causes of variation in overhead costs. This manager already knows that machine hours and production runs are related positively to overhead costs. What he really wants is a set of explanatory variables that yields a high 2r and a low se. The low P-value in the ANOVA table does not guarantee these. All it guarantees is that MachHrs and ProdRuns are of some help in explaining variations in Overhead. (q) Are the regression assumptions satisfied?

The plots of the residuals versus 1,ˆ xy , and 2x show random scatters without any patterns, clumping or excessive increase/decrease in their variation around the horizontal zero line. Therefore, the linearity, independence, and the equal spread regression assumptions are satisfied.

Note: These data were collected over time. Therefore, to make a stringent decision about the independence of the residuals we should also inspect the time series plot of residuals:

-12000

-7000

-2000

3000

8000

70000 80000 90000 100000 110000 120000

Re

sid

ua

ls

Predicted Overhead

Predicted Overhead Residual Plot

-12000

-7000

-2000

3000

8000

1000 1200 1400 1600 1800

Re

sid

ua

ls

MachHrs

MachHrs Residual Plot

-12000

-7000

-2000

3000

8000

20 30 40 50 60

Re

sid

ua

ls

ProdRuns

ProdRuns Residual Plot

- 7 -

This plot shows signs of the so called lag 1 autocorrelation. To access the severity of this condition, we need to perform a test which will be studied later when we explore topics in time series analysis. For now, consider that the independence assumption is not violated if the plots of residuals versus 1,ˆ xy , and 2x look random and do not show systematic patterns.

The histogram of the residuals is single peaked with no apparent outliers. There is a left skew (skewness = – 0.64) which is mild enough to be overcome by the least squares procedure. This is confirmed also by the inspection of the normal probability plot. Except for the mild left skewness (as indicated by the slight upward and then leveled off curving), the points are pretty closely located around a o45 line. Thus, the normality assumption is not seriously violated.

(r) Suppose Bendrix expects the values of MachHrs and ProdRuns for the next three months to be 1430, 1560, 1520, and 35, 45, 40, respectively. What are the point predictions and 95% prediction intervals for Overhead for these three months? First set up a second data set with the following column headings:

-15000

-10000

-5000

0

5000

10000

0 10 20 30 40R

esi

du

als

Month

Time Series of Residuals

0

2

4

6

8

10

12

-93

01

.78

-68

23

.18

-43

44

.58

-18

65

.98

61

2.6

1

30

91

.21

55

69

.81

Fre

qu

en

cy

Histogram of Residual / Residuals

-3.5

-2.5

-1.5

-0.5

0.5

1.5

2.5

3.5

-3.5 -2.5 -1.5 -0.5 0.5 1.5 2.5 3.5

Sta

nd

ard

ize

d Q

-Va

lue

Z-Value

Q-Q Normal Plot of Residual / Residuals

- 8 -

Enter the values for Month, MachHrs, and ProdRuns. The last three columns can be blank or have values, but when regression is run with the prediction options, they will be filled in or overwritten. Define the entire region A1:F4 as a new StatTools data set named Data for Prediction. Then use StatTools as shown below:

The Overhead values in column D are the point predictions for the next three months, and the LowerLimit95 and UpperLimit95 values in columns E and F indicate the 95% prediction intervals. You can see from the wide prediction intervals how much uncertainty remains. The reason is the relatively large standard error of estimate, se = 4108.993. Contrary to what you might expect, this is not a sample size problem. That is, a larger sample size would probably not produce a smaller value of se. The whole problem is that MachHrs and ProdRuns are not perfectly correlated with Overhead. The only way to decrease se and get more accurate predictions is to find other explanatory variables that are more closely related to Overhead.

Note: StatTools provides prediction intervals for individual values, but it does not provide confidence intervals for the mean of y, given a set of x’s.

- 9 -

Validation of the Fit Now suppose that this data set is from one of Bendrix's two plants. The company would like to predict overhead costs for the second plant by using data on machine hours and production runs at the first plant (See Overhead_Costs_Validation.xlsx. The regression output for the first plant is at the Regression tab.) How well does the regression from the first plant fit data from the other plant?

Use the following steps to perform this validation: 1. Copy the results from the original regression to the ranges B5:D5 and B9:B10 in the Validation Data spreadsheet. 2. Calculate the fitted values.

The fitted values are now the predicted values of overhead for the second plant, based on the original regression equation. Find these by substituting the new values of MachHrs and ProdRuns into the original equation. Specifically, enter the formula

=$B$5+SUMPRODUCT($C$5:$D$5,B13:C13)

in cell E13 and copy it down. (You can also use the simpler formula =$B$5+$C$5*B13+$D$5*C13) 3. Calculate the residuals (prediction errors for the second plant) by entering the formula

=D13-E13

in cell F13 and copying it down. 4. Calculate the coefficient of determination by entering the formula

=CORREL(E13:E48,D13:D48)^2

in cell C9. 5. Calculate the standard error of estimate.

The se value is essentially the average of the squared residuals, but it uses the denominator n – 3 (when there are two explanatory variables) rather than n – 1. Therefore, enter the formula

=SQRT(SUMSQ(F13:F48)/33)

in cell C10.

The results are typical. The validation results are usually not as good as the original results. The value of r2 has decreased from 86.6% to 77.3%, and the value of se has increased from $4,109 to $5,257. Nevertheless, Bendrix might conclude that the original regression equation is adequate for making future predictions at either plant.

12 multiple regression part2

Documents