11 multiple regression part1

8/17/2019 11 Multiple Regression Part1

http://slidepdf.com/reader/full/11-multiple-regression-part1 1/13

- 1 -

MULTIPLE REGRESSION – PART 1

Topics Outline

Multiple Regression Model

Inferences about Regression Coefficients

F Test for the Overall Fit

Residual AnalysisCollinearity


Multiple regression models use two or more explanatory (independent) variables to predict thevalue of a response (dependent) variable. With k explanatory variables, the multiple regression

model is expressed as follows:

k k x x x y 2211

Here k ,,,, 21 are the parameters and the error term is a random variable which

accounts for the variability in y that cannot be explained by the linear effect of the k explanatory

variables. The assumptions about the error term in the multiple regression model parallel thosefor the simple regression model.

Regression Assumptions

1. Linearity

The error term is a random variable with a mean 0.

Implication: For given values of k x x x ,,, 21 , the expected, or average, value of y is given by

k k y x x x y E 2211)(

(The relationship is “linear ”, because each term on the right-hand side of the equation is additive, and

the regression parameters do not enter the equation in a nonlinear manner, such as ii x2 . The graph

of the relationship is no longer a line, however, because there are more than two variables involved.)

2. Independence

The values of are statistically independent. Implication: The value of y for a particular set of values for the explanatory variables is notrelated to the value of y for any other set of values.

3. Normality

The error term is a normally distributed random variable (with mean 0 and standard deviation ).

Implication: Because k ,,,, 21 are constants for the given values of k x x x ,,, 21 ,

the response variable y is also a normally distributed random variable

(with mean k k y x x x 2211 and standard deviation ).

4. Equal spread

The standard deviation of is the same for all values of the explanatory variables k x x x ,,, 21 .

Implication: The standard deviation of y about the regression line equals and is the same for

all values of k x x x ,,, 21 .



- 2 -

Assumption 1 implies that the true population surface (“plane”, “line”) is

k k y x x x y E 2211)(

Sometimes we refer to this surface as the surface (plane, line) of means.

In the simple linear regression, the slope represents the change in the mean of y per unitchange in x and does not take into account any other variables.

In the multiple regression model, the slope1 represents the change in the mean of y per unit

change in1

x , taking into account the effect of k x x x ,...,, 32 .

The estimation process for multiple regression is shown in Figure 1. As in the case of simplelinear regression, you use a simple random sample and the least squares method – that is,

minimizing the sum of squared residuals:n

i

ii y y

1

2ˆmin – to compute sample regression

coefficients k bba ,,, 1 as estimates of the population parameters k ,,, 1 .

(In multiple regression, the presentation of the formulas for the regression coefficients involvesthe use of matrix algebra and is beyond the scope of this course.)


k k x x x y 2211 ( – st.dev. of )

True Population Surface

k k y x x x 2211

Regression Parameters

,,,,, 21 k

Sample Data

1 x

2 x k x y

. . . .

. . . .

. . . .

The values of

sbbba k ,,,,, 21

provide the estimates of

,,,,, 21 k

Compute the sample statistics

sbbba k ,,,,, 21

and the estimated regression equation

k k xb xb xba y 2211

ˆ

Figure 1 The estimation process for multiple regression



- 3 -

The sample statistics k bba ,,, 1 provide the following estimated multiple regression equation

k k xb xb xba y 2211ˆ

where a is again the y-intercept, and1

b through k b are the “slopes”. This is the equation of the

fitted surface also known as the least squares surface (plane, line).

Graphically, you are no longer fitting a line to a set of points. If there are exactly two explanatoryvariables, you are fitting a plane to the data in three-dimensional space. There is one dimensionfor the response variable and one for each of the two explanatory variables.

If there are more than two explanatory variables, then you can only imagine the regression surface;drawing in four or more dimensions is impossible.

Interpretation of Regression Coefficients

The intercept a is the predicted value of y when all of the x’ s equal zero. (Of course, this makessense only if it is practical for all of the x’ s to equal zero, which is seldom the case.)

Each slope coefficient is the predicted change in y per unit change in a particular x,

holding constant the effect of the other x variables. For example, 1b is the predicted change in y

when1

x increases by one unit and the other x’ s in the equation, 2 x through k x , remain constant.



- 4 -

Example 1

OmniFoods

OmniFoods is a large food products company. The company is planning a nationwide introductionof OmniPower, a new high-energy bar. Originally marketed to runners, mountain climbers, andother athletes, high-energy bars are now popular with the general public. OmniFoods is anxious to

capture a share of this thriving market. The business objective facing the marketing manager atOmniFoods is to develop a model to predict monthly sales volume per store of OmniPower barsand to determine what variables influence sales. Two explanatory variables are considered here:

1 x – the price of an OmniPower bar, measured in cents and

2 x – the monthly budget for in-store promotional expenditures, measured in dollars.

In-store promotional expenditures typically include signs and displays, in-store coupons, and freesamples. The response variable y is the number of OmniPower bars sold in a month.

Data are collected (and stored in OmniPower.xlsx) from a sample of 34 stores in a supermarketchain selected for a test-market study of OmniPower.

StoreNumber

of Bars

Price

(cents)

Promotion

($)

1 4141 59 200

2 3842 59 200

33 3354 99 600

34 2927 99 600

Here is the regression output.

Regression Statistics

Multiple R 0.8705

R Square 0.7577

Adjusted R Square 0.7421

Standard Error 638.0653

Observations 34

ANOVA

df SS MS F Significance F

Regression 2 39472730.77 19736365.39 48.48 0.0000

Residual 31 12620946.67 407127.31

Total 33 52093677.44

Coefficients Standard Error t Stat P-value Lower 95% Upper 95%

Intercept 5837.5208 628.1502 9.2932 0.0000 4556.3999 7118.641

Price -53.2173 6.8522 -7.7664 0.0000 -67.1925 -39.242

Promotion 3.6131 0.6852 5.2728 0.0000 2.2155 5.010



- 5 -

The computed values of the regression coefficients are a = 5,837.5208, b1 = – 53.2173, b2 = 3.6131.Therefore, the multiple regression equation (representing the fitted regression plane) is

21 6131.32173.535208.5837ˆ x x y

or

Predicted Bars =5,837.5208 – 53.2173 Price + 3.6131 Promotion

Interpretation of intercept

The sample y-intercept (a = 5,837.5208 6,000) estimates the number of OmniPower bars sold ina month if the price is $0.00 and the total amount spent on promotional expenditures is also $0.00.Because these values of price and promotion are outside the range of price and promotion used inthe test-market study, and because they make no sense in the context of the problem, the value ofa has little or no practical interpretation.

Interpretation of slope coefficients

The slope of price with OmniPower sales (b1 = – 53.2173) indicates that, for a given amount ofmonthly promotional expenditures, the predicted sales of OmniPower are estimated to decrease by 53.2173 53 bars per month for each 1-cent increase in the price.

The slope of monthly promotional expenditures with OmniPower sales (b2 = 3.6131) indicates that,for a given price, the estimated sales of OmniPower are predicted to increase by 3.6131 4 bars foreach additional $1 spent on promotions.

These estimates allow you to better understand the likely effect that price and promotiondecisions will have in the marketplace. For example, a 10-cent decrease in price is predicted toincrease sales by 532.173 532 bars, with a fixed amount of monthly promotional expenditures.

A $100 increase in promotional expenditures is predicted to increase sales by 361.31 361 bars,for a given price.

Predicting the Response Variable

What are the predicted sales for a store charging 79 cents per bar during a month in which promotional expenditures are $400?

Using the multiple regression equation with1

x = 79 and2 x = 400,

y = 5837.5208 – 53.2173(79) + 3.6131(400)= 3,078.57

Thus, stores charging 79 cents per bar and spending $400 in promotional expenditures will sell3,078.57 3,079 OmniPower bars per month.



- 6 -

Interpretation of s e , r 2, and r

The interpretation of these quantities is almost exactly the same as in simple regression.

The standard error of estimate se is essentially the standard deviation of residuals, but it is now given by the following equation

1

2

k n

e

s i

i

e

where n is the number of observations and k is the number of explanatory variables in the equation.

Fortunately, you can interpret e s exactly as before. It is a measure of the typical prediction error

when the multiple regression equation is used to predict the response variable.

The coefficient of determination2

r is again the proportion of variation in the response variable

y explained by the combined set of explanatory variablesk

x x x ,,,21 . In fact, it even has the

same formula as before:

SST

SSRr

Squaresof SumTotal

Squaresof SumRegression2

In the OmniPower example (see Excel output),

SSR = 39,472,730.77 and SST = 52,093,677.44Thus,

7577.044.677,093,52

77.730,472,392

SST

SSR

r

The coefficient of determination indicates that 75.77% 76% of the variation in sales isexplained by the variation in the price and in the promotional expenditures.

The square root of 2r is the correlation r between the fitted values y and the observed values y

of the response variable – in both simple and multiple regression.

A graphical indication of the correlation can be seen in the plot of fitted (predicted) y values versus

observed y values. If the regression equation gave perfect predictions, all of the points in this plot

would lie on a 45º line – each fitted value would equal the corresponding observed value. Althougha perfect fit virtually never occurs, the closer the points are to a 45º line, the better the fit is.

The correlation in the OmniPower example is 87.07577.0r indicating a strong

relationship between the two explanatory variables and the response variable. This is confirmed

by the scatterplot of y values versus y values:



- 7 -

Inferences about Regression Coefficients

t Tests for Significance

In a simple linear regression model, to test a hypothesis 0:0 H concerning the population

slope , we used the test statisticbSE

bt with df = n – 2 degrees of freedom.

Similarly, in multiple regression, to test a hypothesis concerning the population slope j

for variable j x (holding constant the effects of all other explanatory variables),

0:

0:0

ja

j

H

H

we use the test statistic

jb

j

SE

bt with df = n – k – 1, where k is the number of the explanatory

variables in the regression equation.

In our example, to determine whether variable2 x (amount of promotional expenditures) has a

significant effect on sales, taking into account the price of OmniPower bars, the null andalternative hypotheses are

0:

0:

2

20

a H

H

The test statistic is 2728.56852.0

6131.3

2

2

bSE

bt with df = n – k – 1 = 34 – 2 – 1 = 31

The P -value is extremely small. Therefore, we reject the null hypothesis that there is no significant

relationship between 2 x (promotional expenditures) and y (sales) and conclude that there is a strong

significant relationship between promotional expenditures and sales, taking into account the price1

x .

For the slope of sales with price, the respective test statistic and P -value are: t = – 7.7664, P -value 0.

Thus, there is a significant relationship between price1

x and sales, taking into account the

promotional expenditures2 x .

0

1000

2000

3000

4000

5000

6000

0 1000 2000 3000 4000 5000 6000

P r e d i c t e d

B a r s

Observed Bars

Predicted versus Observed Bars



- 8 -

If we fail to reject the null hypothesis for a multiple regression coefficient, it does not mean thatthe corresponding explanatory variable has no linear relationship to y. It means that thecorresponding explanatory variable contributes nothing to modeling y after allowing for all the

other explanatory variables.

The parameter j in a multiple regression model can be quite different from zero even when it is possible there is no simple linear relationship between j x and y. The coefficient of j x in a multiple

regression depends as much on the other explanatory variables as it does on j x . It is even possible

that the multiple regression slope changes sign when a new variable enters the regression model.

Confidence Intervals

To estimate the value of a population slope j in multiple regression, we can use the following

confidence interval

j j SE t b *

where t * is the critical value for a t distribution with df = n – k – 1 degrees of freedom.

To construct a 95% confidence interval estimate of the population slope1 (the effect of price

1 x on sales y, holding constant the effect of promotional expenditures

2 x ), the critical value of t

at the 95% confidence level with 31 degrees of freedom is t * = 2.0395.(Note: For df = 30, the t -Table gives t * = 2.042.)

Then, using the information from the Excel output,

11 *SE t b = – 53.2173 2.0395(6.8522) = – 53.2173 13.9752 = – 67.1925 to – 39.2421Taking into account the effect of promotional expenditures, the estimated effect of a 1-centincrease in price is to reduce mean sales by approximately 39.2 to 67.2 bars. You have 95%confidence that this interval correctly estimates the relationship between these variables.

From a hypothesis-testing viewpoint, because this confidence interval does not include 0,

you conclude that the regression coefficient1, has a significant effect.

The 95% confidence interval for the slope of sales with promotional expenditures is

22 *SE t b = 3.6131 2.0395(0.6852) = 3.6131 1.3975 = 2.2156 to 5.0106

Thus, taking into account the effect of price, the estimated effect of each additional dollar of promotional expenditures is to increase mean sales by approximately 2.22 to 5.01 bars. You have95% confidence that this interval correctly estimates the relationship between these variables.

From a hypothesis-testing viewpoint, because this confidence interval does not include 0,

you can conclude that the regression coefficient 2 has a significant effect.



- 9 -

F Test for the Overall Fit

In simple linear regression, the t test and the F test provide the same conclusion; that is, if the

null hypothesis is rejected, we conclude that 0 . In multiple regression, the t test and the F

test have different purposes. The t test of significance for a specific regression coefficient in

multiple regression is a test for the significance of adding that variable into a regression model,given that the other variables are included. In other words, the t test for the regression coefficientis actually a test for the contribution of each explanatory variable. The overall F test is used todetermine whether there is a significant relationship between the response variable and the entire set of explanatory variables. We also say that it determines the explanator y power of the model.

The null and alternative hypotheses for the F test are:

0: 210 k H (There is no significant relationship between the

response variable and the explanatory variables.)

0oneleastAt: ja H (There is a significant relationship between the response

variable and at least one of the explanatory variables.)

Failing to reject the null hypothesis implies that the explanatory variables are of little or no use inexplaining the variation in the response variable; that is, the regression model predicts no betterthan just using the mean. Rejection of the null hypothesis implies that at least one of theexplanatory variables helps explain the variation in y and therefore, the regression model is useful.

The ANOVA table for multiple regression has the following form.

Source of

Variation

Degrees

of

Freedom

Sum

of Squares

Mean Squares

(Variance) F statistic P-value

Regression k SSRk

SSR MSR

MSE

MSR F Prob > F

Error n – k – 1 SSE1k n

SSE MSE

Total 1n SST

The F test statistic follows an F -distribution with k and (n – k – 1) degrees of freedom.For our example, the hypotheses are:

zerotoequalnotisand/or:

0:

21

210

a H

H

The corresponding F distribution has df1 = 2 and df2 = n – 2 – 1 = 34 – 3 = 31 degrees offreedom. The test statistic is F = 48.4771 and the corresponding P -value is

P -value = FDIST(48.4771,2,31) = 0.00000000029 0

We reject 0 H and conclude that at least one of the explanatory variables (price and/or

promotional expenditures) is related to sales.



- 10 -

Residual Analysis

Three types of residual plots are appropriate for multiple regression.

1. Residuals versus y ’s (the predicted values of y)

This plot should look patternless. If the residuals show a pattern (e.g. a trend, bend, clumping),there is evidence of a possible curvilinear effect in at least one explanatory variable, a possibleviolation of the assumption of equal variance, and/or the need to transform the y variable.

2. Residuals versus each x

Patterns in the plot of the residuals versus an explanatory variable may indicate the existence of acurvilinear effect and, therefore, the need to add a curvilinear explanatory variable to themultiple regression model.

3. Residuals versus time

This plot is used to investigate patterns in the residuals in order to validate the independenceassumption when one of the x-variables is related to time or is itself time.

Below are the residual plots for the OmniPower sales example. There is very little or no pattern

in the relationship between the residuals and the predicted value of y, the value of1

x (price), or

the value of2 x (promotional expenditures). Thus, you can conclude that the multiple regression

model is appropriate for predicting sales.

There is no need to plot the residuals versus time because the data were not collected in time order.

-2000

-1500

-1000

-500

0

500

1000

1500

0 1000 2000 3000 4000 5000 6000

R e s i d u a l s

Predicted Bars

Residuals versus Predicted Bars



- 11 -

The third regression assumption states that the errors are normally distributed. We can check itthe same way as we did it in simple regression – by forming a histogram or a normal probability(Q-Q) plot of the residuals. If the third assumption holds, the histogram should be approximatelysymmetric and bell-shaped, and the points in the normal probability plot should be close to a 450 line. But if there is an obvious skewness, too many residuals more than, say, two standarddeviations from the mean, or some other nonnormal property, this indicates a violation of thethird assumption.

Neither the histogram, nor the normal probability plot for the OmniPower example shows anysevere signs of departure from normality.

-2000

-1500

-1000

-500

0

500

1000

1500

0 50 100 150

R e s i d u a l s

Price

Price Residual Plot

-2000

-1500

-1000

-500

0

500

1000

1500

0 200 400 600 800

R e s i d u a l s

Promotion

Promotion Residual Plot

0

2

4

6

8

10

12

- 1 4 6 5 . 0 1

- 1 0 3 3 . 0 9

- 6 0 1 . 1 8

- 1 6 9 . 2 7

2 6 2 . 6 4

6 9 4 . 5 6

1 1 2 6 . 4 7

F r e q u e n c y

Histogram of Residual / Data Set #2

-3.5

-2.5

-1.5

-0.5

0.5

1.5

2.5

3.5

-3.5 -2.5 -1.5 -0.5 0.5 1.5 2.5 3.5

S t a n d a

r d i z e d Q - V a l u e

Z-Value

Q-Q Normal Plot of Residual / Data Set #2



- 12 -

Collinearity

Most explanatory variables in a multiple regression problem are correlated to some degree withone another. For example, in the OmniPower case the correlation matrix is

Price (1

x ) Promotion (2 x ) Bars ( y)

Price (1

x ) 1.0000

Promotion (2 x ) – 0.0968 1.0000

Bars ( y) – 0.7351 0.5351 1.0000

The correlation between price and promotion is – 0.0968. Thus, we find some degree of linearassociation between the two explanatory variables.

Low correlations among the explanatory variables generally do not result in serious deteriorationof the quality of the least squares estimates. However, when the explanatory variables are highlycorrelated, it becomes difficult to determine the separate effect of any particular explanatoryvariable on the response variable. We interpret the regression coefficients as measuring the change

in the response variable when the corresponding explanatory variable increases by 1 unit while allthe other explanatory variables are held constant. The interpretation may be impossible when theexplanatory variables are highly correlated, because when the explanatory variable changes by 1unit, some or all of the other explanatory variables will change.

Collinearity (also called multicollinearity or intercorrelation) is a condition that exists whentwo or more of the explanatory variables are highly correlated with each other. When highlycorrelated explanatory variables are included in the regression model, they can adversely affect theregression results. Two of the most serious problems that can arise are:

1. The estimated regression coefficients may be far from the population parameters, includingthe possibility that the statistic and the parameter being estimated may have opposite signs.

For example, the true slope 2 might actually be +10 and2

b , its estimate, might turn out to be – 3.

2. You might find a regression that is very highly significant based on the F test but for whichnot even one of the t tests of the individual x variables is significant. Thus, variables that arereally related to the response variable can look like they aren’t related, based on their P -values.In other words, the regression result is telling you that the x variables taken as a group explaina lot about y, but it is impossible to single out any particular x variables as being responsible.

Statisticians have developed several routines for determining whether collinearity is high enoughto cause problems. Here are the three most widely used techniques:

1. Pairwise correlations between x’ s

The rule of thumb suggests that collinearity is a potential problem if the absolute value of thecorrelation between any two explanatory variables exceeds 0.7.(Note: Some statisticians suggest a cutoff of 0.5 instead of 0.7.)

2. Pairwise correlations between y and x’ s

The rule of thumb suggests that collinearity may be a serious problem if any of the pairwisecorrelations among the x variables is larger than the largest of the correlations between the y variable and the x variables.



- 13 -

3. Variance inflation factors

The statistic that measures the degree of collinearity of the j-th explanatory variable with theother explanatory variables is called variance inflation factor (VIF) and is found as:

21

1

j

jr

VIF

where 2 jr is the coefficient of determination for a regression model using variable j x as the

response variable and all other x variables as explanatory variables.

The VIF tells how much the variance of the regression coefficient has been inflated due tocollinearity. The higher the VIF , the higher the standard error of its coefficient and the less it can

contribute to the regression model. More specifically, the2

jr shows how well the j-th explanatory

variable can be predicted by the other explanatory variables. The 1 – 2

jr term measures what that

explanatory variable has left to bring to the model. If2

jr is high, then not only is that variable

superfluous, but it can damage the regression model.

Since2

jr cannot be less than zero, the minimum value of the VIF is 1.

If a set of explanatory variables is uncorrelated, then each2

jr = 0.0 and each jVIF is equal to 1.

As2

jr increases, jVIF increases also. For example, if2

jr = 0.9, then jVIF = 1/(1 – 0.9) = 10;

if2

jr = 0.99, then jVIF = 1/(1 – 0.99) = 100.

How large the VIF s must be to suggest a serious problem with collinearity is not completely clear.

In general, any individual jVIF larger than 10 is considered as an indication of a potential

collinearity problem. (Note: Some statisticians suggest using the cutoff of 5 instead of 10.)

In the OmniPower sales data, the correlation between the two explanatory variables, price and promotional expenditure, is – 0.0968. Because there are only two explanatory variables in the model,

009.10968.01

1221 VIF VIF

Since all VIF ’s (two in this example) are less than 10 (or, less than the more conservative value of 5),you can conclude that there is no problem with collinearity for the OmniPower sales data.

One solution to the collinearity problem is to delete the variable with the largest VIF value.The reduced model is often free of collinearity problems.

Another solution is to redefine some of the variables so that each x variable has a clear, unique role in

explaining y. For example, if1

x and2 x are collinear, you might try using

1 x and the ratio

1

2

x

x instead.

If possible, every attempt should be made to avoid including explanatory variables that arehighly correlated. In practice, however, strict adherence to this policy is rarely achievable.

11 multiple regression part1

Documents