chapter 14

34
Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved. McGraw-Hill/Irwin Multiple Regression and Model Building Chapter 14

Upload: lynley

Post on 06-Jan-2016

37 views

Category:

Documents


0 download

DESCRIPTION

Chapter 14. Multiple Regression and Model Building. Multiple Regression and Model Building. 14.1The Multiple Regression Model and the Least Squares Point Estimate 14.2Model Assumptions and the Standard Error 14.3R ² and Adjusted R ² 14.4The Overall F Test - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Chapter 14

Copyright © 2010 by The McGraw-Hill Companies, Inc. All rights reserved.

McGraw-Hill/Irwin

Multiple Regression and Model Building

Chapter 14

Page 2: Chapter 14

14-2

Multiple Regression and Model Building

14.1The Multiple Regression Model and the Least Squares Point Estimate

14.2Model Assumptions and the Standard Error

14.3R² and Adjusted R²14.4The Overall F Test14.5Testing the Significance of an

Independent Variable

Page 3: Chapter 14

14-3

Multiple Regression and Model Building Continued

14.6Confidence and Prediction Intervals14.7Using Dummy Variables to Model

Qualitative Independent Variables14.8Model Building and the Effects of

Multicollinearity14.9Residual Analysis in Multiple

Regression

Page 4: Chapter 14

14-4

14.1 The Multiple Regression Model and the Least Squares Point Estimate

Simple linear regression used one independent variable to explain the dependent variable

Multiple regression uses two or more independent variables to describe the dependent variable This allows multiple regression models to

handle more complex situations There is no limit to the number of

independent variables a model can use Has only one dependent variable

Page 5: Chapter 14

14-5

The Multiple Regression Model

The linear regression model relating y to x1, x2,…, xk is y = β0 + β1x1 + β2x2 +…+ βkxk +

µy = β0 + β1x1 + β2x2 +…+ βkxk is the mean value of the dependent variable y

β0, β1, β2,… βk are unknown the regression parameters relating the mean value of y to x1, x2,…, xk

is an error term that describes the effects on y of all factors other than the independent variables x1, x2,…, xk

Page 6: Chapter 14

14-6

The Least Squares Estimates and Point Estimation and Prediction

Estimation/prediction equationy = b0 + b1x01 + b2x02 + … + bkx0k

is the point estimate of the dependent variable when the independent variables are x1, x2,…, xk

It is also the point prediction of an individual value of the dependent variable when the independent variables are x1, x2,…, xk

b0, b1, b2,…, bk are the least squares point estimates of the parameters β0, β1, β2,…, βk

x01, x02,…, x0k are specified values of the independent predictor variables x1, x2,…, xk

Page 7: Chapter 14

14-7

Fuel Consumption Case MINITAB Output

Figure 14.4 (a)

Page 8: Chapter 14

14-8

14.2 Model Assumptions and the Standard Error

The model is

y = β0 + β1x1 + β2x2 + … + βkxk +

Assumptions for multiple regression are stated about the model error terms, ’s

Page 9: Chapter 14

14-9

The Regression Model Assumptions

1. Mean of Zero Assumption2. Constant Variance Assumption3. Normality Assumption4. Independence Assumption

Page 10: Chapter 14

14-10

Sum of Squares

1

Error Standard

1

Error SquareMean

)ˆ(

Errors Squared of Sum

2

22

kn-

SSEMSEs

kn-

SSEMSEs

yyeSSE iii

Page 11: Chapter 14

14-11

14.3 R2 and Adjusted R2

1. Total variation is given by the formulaΣ(yi - y)2

2. Explained variation is given by the formulaΣ(yi - y)2

3. Unexplained variation is given by the formula Σ(yi - yi)2

4. Total variation is the sum of explained and unexplained variation

Page 12: Chapter 14

14-12

R2 and Adjusted R2 Continued

5. The multiple coefficient of determination is the ratio of explained variation to total variation

6. R2 is the proportion of the total variation that is explained by the overall regression model

7. Multiple correlation coefficient R is the square root of R2

Page 13: Chapter 14

14-13

Multiple Correlation Coefficient R

The multiple correlation coefficient R is just the square root of R2

With simple linear regression, r would take on the sign of b1

There are multiple bi’s with multiple regression

For this reason, R is always positive To interpret the direction of the

relationship between the x’s and y, you must look to the sign of the appropriate bi coefficient

Page 14: Chapter 14

14-14

The Adjusted R2

Adding an independent variable to multiple regression will raise R2

R2 will rise slightly even if the new variable has no relationship to y

The adjusted R2 corrects this tendency in R2

As a result, it gives a better estimate of the importance of the independent variables

)1(

1

1R 22

kn

n

n

kR

Page 15: Chapter 14

14-15

14.4 The Overall F Test

H0: β1= β2 = …= βk = 0 versusHa: At least one of β1, β2,…, βk ≠ 0

The test statistic is

Reject H0 in favor of Ha if F(model) > F*

orp-value <

*F is based on k numerator and n-(k+1) denominator degrees of freedom

1)](k-)/[n variationed(Unexplain

)/k variation(Explained

F(model)

Page 16: Chapter 14

14-16

14.5 Testing the Significance of an Independent Variable

A variable in a multiple regression model is not likely to be useful unless there is a significant relationship between it and y

To test significance, we use the null hypothesis H0: βj = 0

Versus the alternative hypothesisHa: βj ≠ 0

Page 17: Chapter 14

14-17

Testing Significance of an Independent Variable #2

Alternative

Reject H0 If

p-Value

Ha: βj > 0 t > tαArea under t distribution right of t

Ha: βj < 0 t < –tαArea under t distribution left of t

Ha: βj ≠ 0 |t| > t/2* Twice area under t

distribution right of |t|* That is t > t/2 or t < –t/2

Page 18: Chapter 14

14-18

Testing Significance of an Independent Variable #3

Test Statistics

100(1-)% Confidence Interval for βj

[b1 ± t/2 Sbj]

t, t/2 and p-values are based on n-(k+1) degrees of freedom

bj

j

s

bt=

Page 19: Chapter 14

14-19

Testing Significance of an Independent Variable #4

It is customary to test the significance of every independent variable

If we can reject H0: βj = 0 at the 0.05 level of significance, we have strong evidence that the independent variable xj is significantly related to y At the 0.01 level of significance, we have very

strong evidence The smaller the significance level at

which H0 can be rejected, the stronger the evidence that xj is significantly related to y

Page 20: Chapter 14

14-20

A Confidence Interval for the Regression Parameter βj

If the regression assumptions hold, 100(1-)% confidence interval for βj

is [b1 ± t/2 Sbj]

t/2 is based on n – (k + 1) degrees of freedom

Page 21: Chapter 14

14-21

14.6 Confidence and Prediction Intervals

The point corresponding to a particular value of x01, x02,…, x0k, of the independent variables isy = b0 + b1x01 + b2x02 + … + bkx0k It is unlikely that this value will equal the mean

value of y for these x values Need bounds on how far the predicted

value might be from the actual value We can do this by calculating a confidence

interval for the mean value of y and a prediction interval for an individual value of y

Page 22: Chapter 14

14-22

A Confidence Interval and a Prediction Interval

valueDistance1]ty[

Interval Prediction

valueDistance]ty[

Interval Confidence

)ˆ()ˆ(/2

)ˆ()ˆ(/2

sss

sss

yyyy

yyyy

Page 23: Chapter 14

14-23

14.7 Using Dummy Variables to Model Qualitative Independent Variables

So far, we have only looked at including quantitative data in a regression model

However, we may wish to include descriptive qualitative data as well For example, might want to include the

gender of respondents We can model the effects of different

levels of a qualitative variable by using what are called dummy variables Also known as indicator variables

Page 24: Chapter 14

14-24

How to Construct Dummy Variables

A dummy variable always has a value of either 0 or 1

For example, to model sales at two locations, would code the first location as a zero and the second as a 1Operationally, it does not matter

which is coded 0 and which is coded 1

Page 25: Chapter 14

14-25

What If We Have More Than TwoCategories?

Consider having three categories, say A, B and C

Cannot code this using one dummy variable A = 0, B = 1 and C = 2 would be invalid Assumes the difference between A and B is the

same as B and C

We must use multiple dummy variables Specifically, k categories requires k - 1 dummy

variables

Page 26: Chapter 14

14-26

What If We Have More Than TwoCategories? Continued

For A, B, and C, would need two dummy variablesx1 is 1 for A, zero otherwise

x2 is 1 for B, zero otherwise

If x1 and x2 are zero, must be CThis is why the third dummy variable is

not needed

Page 27: Chapter 14

14-27

Interaction Models

So far, have only considered dummy variables as stand-alone variables Model so far is y = β0 + β1x + β2D + Where D is dummy variable

However, can also look at interaction between dummy variable and other variables That model would take the form

y = β0 + β1x + β2D + β3xD + With an interaction term, both the intercept

and slope are shifted

Page 28: Chapter 14

14-28

14.8 Model Building and the Effects of Multicollinearity

Multicollinearity causes problems evaluating the p-values of the model

Therefore, we need to evaluate more than the additional importance of each independent variable

We also need to evaluate how the variables work together

One way to do this is to determine if the overall model gives a high R² and adjusted R², a small s, and short prediction intervals

Page 29: Chapter 14

14-29

Effect of Adding Independent Variable

Adding any independent variable will increase R²

Even adding an unimportant independent variable

Thus, R² cannot tell us that adding an independent variable is undesirable

Page 30: Chapter 14

14-30

A Better Criterion

A better criterion is the size of the standard error s

If s increases when an independent variable is added, we should not add that variable

However, decreasing s alone is not enough An independent variable should only be

included if it reduces s enough to offset the higher t value and reduces the length of the desired prediction interval for y

Page 31: Chapter 14

14-31

C Statistic

Another quantity for comparing regression models is called the C (a.k.a. Cp) statistic

First, calculate mean square error for the model containing all p potential independent variables (s2

p) Next, calculate SSE for a reduced model

with k independent variables

122

kns

SSEC

p

Page 32: Chapter 14

14-32

C Statistic Continued

We want the value of C to be small Adding unimportant independent variables

will raise the value of C While we want C to be small, we also wish

to find a model for which C roughly equals k+1 A model with C substantially greater than k+1

has substantial bias and is undesirable If a model has a small value of C and C for this

model is less than k+1, then it is not biased and the model should be considered desirable

Page 33: Chapter 14

14-33

14.9 Residual Analysis in MultipleRegression

For an observed value of yi, the residual is

ei = yi - y = yi – (b0 + b1xi1 + … + bkxik)

If the regression assumptions hold, the residuals should look like a random sample from a normal distribution with mean 0 and variance σ2

Page 34: Chapter 14

14-34

Residual Plots

Residuals versus each independent variable

Residuals versus predicted y’sResiduals in time order (if the

response is a time series)