part v analyzing relationships 14 - pearson ed

49
14 CHAPTER Analyzing Linear Relationships, Two or More Variables INTRODUCTION In the previous chapter, we introduced Kate Cameron, the owner of Woodbon, a company that produces high-quality wooden furniture. Kate wanted to understand why sales have grown steadily over recent years, with an eye to planning for the future. After carefully checking required conditions for the analysis, Kate created a mathematical model of the relationship between Woodbon’s sales and advertising. While there did seem to be a significant relationship between the two variables, the variability in the predictions meant that the model was not that useful for predicting sales. The discussion in the previous chapter was a useful introduction to analyzing relationships between two variables. However, a more realistic process would begin with Kate analyzing the relationship between Woodbon’s sales and a number of possible explanatory vari- ables. Some that have already been mentioned are housing starts and mortgage rates. Kate would likely examine a number of possible explanatory variables, with the aim of developing a model that is economical (that is, has reasonable data requirements) and works well (that is, makes useful predictions). LEARNING OBJECTIVES After mastering the material in this chapter, you will be able to: Estimate the linear relationship between a quantitative response variable and one or more explanatory variables. Check the conditions required for use of the regression model in hypothesis testing and prediction. Assess the regression relationship, using appro- priate hypothesis tests and a coefficient of determination. Make predictions using the regression relation- ship. Understand the considerations involved in choosing the “best” regression model, and the challenges presented by multicollinearity. Use indicator variables to model qualitative explanatory variables. 6 5 4 3 2 1 PART V ANALYZING RELATIONSHIPS

Upload: others

Post on 21-Apr-2022

1 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

14CHAPTER

Analyzing LinearRelationships, Twoor More Variables

INTRODUCTION

In the previous chapter, we introduced Kate Cameron, the owner ofWoodbon, a company that produces high-quality wooden furniture.Kate wanted to understand why sales have grown steadily over recentyears, with an eye to planning for the future. After carefully checkingrequired conditions for the analysis, Kate created a mathematicalmodel of the relationship between Woodbon’s sales and advertising.While there did seem to be a significant relationship between the twovariables, the variability in the predictions meant that the model wasnot that useful for predicting sales.

The discussion in the previous chapter was a useful introductionto analyzing relationships between two variables. However, a morerealistic process would begin with Kate analyzing the relationshipbetween Woodbon’s sales and a number of possible explanatory vari-ables. Some that have already been mentioned are housing starts andmortgage rates. Kate would likely examine a number of possibleexplanatory variables, with the aim of developing a model that iseconomical (that is, has reasonable data requirements) and workswell (that is, makes useful predictions).

LEARNING OBJECTIVES

After mastering the material in this chapter, you will beable to:

Estimate the linear relationship between aquantitative response variable and one or moreexplanatory variables.

Check the conditions required for use of theregression model in hypothesis testing andprediction.

Assess the regression relationship, using appro-priate hypothesis tests and a coefficient ofdetermination.

Make predictions using the regression relation-ship.

Understand the considerations involved inchoosing the “best” regression model, and thechallenges presented by multicollinearity.

Use indicator variables to model qualitativeexplanatory variables.

6

5

4

3

2

1

PART V ANALYZING RELATIONSHIPS

14Ch14_Skuce.QXD 1/16/10 5:47 AM Page 525

Page 2: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

Section 14.1 builds on the discussion in Chapter 13, to extend the mathematicalmodel to include more than one explanatory variable. A reasonable way to start iswith some careful thinking about what other factors could most reasonably beexpected to affect Woodbon’s sales. For more complex models, it is crucial to havecomputer software to do the calculations, and you will see how to use Excel to buildthe mathematical model.

Section 14.2 extends the theoretical model from the last chapter to include moreexplanatory variables, revisiting the discussion about least-squares models. As before,we will use Excel to check the required conditions for the regression model.

Section 14.3 introduces hypothesis tests about the significance of the overall model,and the individual explanatory variables. We will also discuss a measure of the strengthof the relationship between the explanatory variables and the response variables, theadjusted coefficient of determination (adjusted R2).

Section 14.4 describes an Excel add-in that you can use to make predictions of aver-age and individual response variables, given specific values of the explanatory variablesin the model.

In Section 14.5, we will discuss an approach to selecting the best explanatory vari-ables for our regression model. Kate will want to develop a model of sales that makesgood predictions, but the simplest model that does a good job will be preferred.Selecting the appropriate explanatory variables is an art as well as a science. An Exceladd-in that produces a summary of all possible models will be introduced.

In Section 14.5, we will look at ways to assess and deal with a new problem that mayarise when there is more than one explanatory variable. This problem is usually referredto as “multicollinearity,” and it occurs when one of the explanatory variables is related toone or more of the other explanatory variables.

It is possible to include qualitative explanatory variables in regression models, andSection 14.6 illustrates the use of indicator variables to accomplish this.

Section 14.7 refers briefly to more advanced models, so that you can get a sense ofthe wide variety of mathematical modelling possibilities.

PART V ANALYZING RELATIONSHIPS526

DETERMINING THE RELATIONSHIP—MULTIPLE LINEAR REGRESSION

In Chapter 13, Kate Cameron examined the relationship between advertising spendingand sales. This simple linear regression model served as an introduction to the techniquesof linear regression modelling.

It seems reasonable to think that there is a cause-and-effect relationship betweenadvertising and sales. Kate is also wondering if sales are significantly affected by otherexplanatory variables. In particular, she is wondering about three others:

• Mortgage rates may affect a household’s ability to buy furniture. Kate expects therelationship to be negative, that is, when mortgage rates are higher, she wouldexpect a household to have less income available to buy wooden furniture.

14.1

14Ch14_Skuce.QXD 1/16/10 5:47 AM Page 526

Page 3: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

• Housing starts may also be related to sales. When more houses are being built, morehouseholds might buy Woodbon’s furniture. There is a fairly long lead time for acustomer to take delivery of Woodbon’s furniture, so housing starts may be a usefulexplanatory variable.

• Kate has been exploring Statistics Canada data, and has discovered a series of “leadingindicators.” In particular, she has identified a leading indicator for retail trade infurniture and appliances. Although the indicator is for Canada as a whole, Kate iswondering if it can give her some insight into Woodbon’s sales.

Creating Graphs to Examine the Relationships Betweenthe Response Variable and the Explanatory VariablesKate begins by collecting data for the three additional (potential) explanatory variables.She finds Statistics Canada data for housing starts in New Brunswick1 (Woodbon islocated in Saint John, and delivers throughout the province). The data are available on aquarterly basis. Kate decides to add up the quarterly numbers so she can relate annualhousing starts to annual sales.

Statistics Canada provides data on a variety of mortgage interest rates, and Katedecides to work with mortgage rates for five-year conventional mortgages at charteredbanks. Statistics Canada provides data about monthly mortgage rates2 (based on the lastWednesday of the month). Kate could simply use the mortgage rate for one month ofthe year to represent mortgage rates for that year (e.g., the January or June mortgagerates). In general, it is simplest to use the available data in raw form to build models. Inthis case, Kate decides to compute a simple average of the monthly rates to create a dataseries of annual average mortgage rates.

An excerpt of the data set, including data on sales and advertising, is shown on the nextpage in Exhibit 14.1. The complete data set is available in an Excel file called SEC14-2.

Initially, it can be useful to create scatter diagrams to explore the relationship betweensales and each one of the potential explanatory variables. The four scatter diagrams areshown in Exhibit 14.2.

We have already established (in Chapter 13) that there is a positive association betweenWoodbon’s advertising expenditure and sales. From the scatter diagrams we can see thatthere is a negative relationship between Woodbon sales and mortgage interest rates, asexpected. There appears to be a positive relationship between Woodbon sales and theCanada-wide leading indicator for retail trade in furniture and appliances, although it maynot be linear. There is a somewhat curved appearance to the plot, which flattens out forhigher levels of the leading indicator.3 There does not appear to be much of a relationshipbetween Woodbon sales and housing starts in New Brunswick. However, this does not nec-essarily mean that housing starts will not be useful in the regression model. This variable, inconjunction with others, could still potentially improve the model’s predictions.

CHAPTER 14 ANALYZING LINEAR RELATIONSHIPS, TWO OR MORE VARIABLES 527

SEC14-2

1 Statistics Canada, “CMHC, Housing Starts, under Construction and Completions, All Areas; NewBrunswick; Housing Starts; Total Units; Unadjusted (units) [J15005],” CANSIM Table 027-0008,www.statcan.gc.ca, accessed October 13, 2008.2 Statistics Canada, “Financial Market Statistics, Last Wednesday Unless Otherwise Stated, Monthly(percent)(1), Bank of Canada – 7502 Rates (Percent) Chartered Bank – Conventional Mortgage:5 year,” CANSIM Table 176-0043, www.statcan.gc.ca, accessed October 13, 2008.3 It is possible to model non-linear relationships, but this is a more advanced topic.

14Ch14_Skuce.QXD 1/16/10 5:47 AM Page 527

Page 4: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

PART V ANALYZING RELATIONSHIPS528

EXHIBIT 14.1Woodbon Sales and Explanatory Variable Data

Woodbon Mortgage Rates, Housing Starts, Advertising Expenditure,

Leading Indicator, and Sales, 1980–2007

Year

Mortgage

Rates

Housing Starts

(New

Brunswick)

Advertising

Expenditure

Leading

Indicator

(Retail Trade,

Furniture and

Appliances,

Canada) Sales

1980 14.52083 2,646 $ 500 599 $ 26,345

1981 18.37500 2,188 $ 695 639 $ 31,987

1982 18.04167 1,680 $ 765 577 $ 21,334...

......

......

...

2004 6.23333 3,947 $2,500 2,014 $101,760

2005 5.99167 3,959 $2,700 2,195 $ 95,400

2006 6.66250 4,085 $3,500 2,515 $115,320

2007 7.07083 4,242 $3,200 2,648 $108,550

EXHIBIT 14.2Scatter Diagrams for Woodbon Sales and Explanatory Variable Data

14Ch14_Skuce.QXD 1/16/10 5:47 AM Page 528

Page 5: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

Determining the Relationship Between the Response Variableand the Explanatory VariablesWe will begin our analysis by adding all of the new explanatory variables to create a newmultiple regression model. The model is built using Excel’s Data Analysis Regressiontool, as illustrated in Chapter 13. The only difference is that more than one explanatoryvariable will be selected for Input X Range:. It is highly recommended that you includelabels when selecting the data in Excel, because it will make the output much easier toread. Exhibit 14.3 illustrates.

CHAPTER 14 ANALYZING LINEAR RELATIONSHIPS, TWO OR MORE VARIABLES 529

EXHIBIT 14.3Excel Regression Dialogue Box

Select labels as well as data for

x and yvalues.

Select all of the explanatoryvariable data (in adjacent

columns) for multipleregression.

The Woodbon output for the regression model is shown on the next page in Exhibit 14.4.At first glance, this model looks promising. The R2 value is 0.955. The mathematical

relationship from the regression model is as follows:

Woodbon annual sales � $89,159.92 � $3,814.57 (mortgage rate) � $6.40 (housing starts) � $14.71 (advertising expenditure) � $10.44 (leading indicator)

How do we interpret this mathematical model?

1. The intercept $89,159.92 is an estimate of average Woodbon sales when all of theexplanatory variables have a value of zero. This number does not have a practicalinterpretation, because it is highly unlikely, for example, that mortgage rates wouldever be zero. Additionally, a regression model should never be applied for values ofthe explanatory variables that are outside of the ranges in the data set used to build

14Ch14_Skuce.QXD 1/16/10 5:47 AM Page 529

Page 6: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

the model, and none of the variable values in the data set for Woodbon was evenclose to zero.

2. The mortgage rate coefficient can be interpreted as follows. If all values of the othervariables are fixed at specific levels, there would be a decrease in Woodbon’s averagesales by $3,814.57 for each 1% increase in the conventional five-year mortgage ratesat the chartered banks. It seems reasonable that higher mortgage rates would leaveless money available to households for spending on furniture, and so the negativerelationship makes sense.

3. The housing starts coefficient can be interpreted as follows. If all values of the otherexplanatory variables are fixed at specific levels, there would be a decrease inWoodbon’s average sales by $6.40 for each additional housing start in NewBrunswick. We would have expected that furniture spending would increase, notdecrease, with additional housing starts. However, it is important to recognize thatthis coefficient applies only when all of the other explanatory variables are includedin the model. There may be some interaction between housing starts and one ormore of the other explanatory variables that results in a coefficient of the “wrong”sign. Remember, there did not appear to be a strong relationship betweenWoodbon’s sales and housing starts in the first place. The fact that the sign on thecoefficient is “wrong” increases our suspicion that this “explanatory” variable maynot actually explain very much about Woodbon’s sales.

4. The advertising expenditure coefficient indicates that each additional dollar inadvertising spending results in an increase of $14.71 in Woodbon’s annual sales,when all other variables held the same. Note that this coefficient is different fromthe $35.17 value in the regression model based on advertising expenditure alone(see Chapter 13 page 484).

PART V ANALYZING RELATIONSHIPS530

EXHIBIT 14.4Excel Regression Output for Woodbon Sales and Explanatory Variable Data (All Variables Included)

14Ch14_Skuce.QXD 1/16/10 5:47 AM Page 530

Page 7: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

CHAPTER 14 ANALYZING LINEAR RELATIONSHIPS, TWO OR MORE VARIABLES 531

DEVELOP YOUR SKILLS 14.1The Develop Your Skills exercises in this chapter frequentlyrefer to a data set called “Salaries.”

1. For the Salaries data set, create scatter diagrams show-ing the relationship between each possible explanatoryvariable and salaries. Are there any obvious problems?Are there some variables that seem particularly strongas candidates for explanatory variables?

2. For the Salaries data set, create a multiple regressionmodel that includes all the possible explanatory vari-ables. Interpret this model. Are there any obviousdifficulties with this model?

3. Create a scatter diagram showing the relationshipbetween age and years of experience. Does it seemsensible to include both of these explanatory variablesin the model?

4. Create a multiple regression model for the Salaries dataset that includes years of postsecondary education andage as explanatory variables. Interpret the model.

5. Create a multiple regression model for the Salaries dataset that includes years of postsecondary education andyears of experience as explanatory variables. Interpretthe model.

It appears that the “all-in” model has some difficulties. Normally, we might stop andrethink at this point. However, we will continue analyzing this model, because it willgive us the opportunity to discuss relationships which both do and do not meet therequired conditions of the theoretical linear regression model.

SALARIES

CHECKING THE REQUIRED CONDITIONSThe Theoretical ModelIn Chapter 13, we described how the least-squares line was created, as a best fit betweenthe explanatory and response variables. The theoretical relationship was

y � b0 � b1x � �

This indicated that the y-value could be predicted from x. The � term reminds usthat we do not expect the prediction to be perfect. There may be some unexplained orrandom variation in the y-values that cannot be predicted from the x-values.

The corresponding notation for the regression relationship based on sample data is

The coefficients b0 and b1 were arrived at by minimizing the sum of the squaredresiduals for the data set, that is

Now we extend the model so that it includes more explanatory variables:

y � b0 � b1x1 � b2x2 � ��� � bkxk � �

The corresponding notation for the relationship based on sample data is

yN = b0 + b1x1 + b2x2 +Á

+ bkxk

SSE = a 1yi - yNi22

yN = b0 + b1x

14.2

14Ch14_Skuce.QXD 1/16/10 5:47 AM Page 531

Page 8: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

Again, the coefficients are estimated by minimizing the sum of squared residuals,for all of the data points in the data set. This requires the use of advanced algebra, butthe idea is the same as in Chapter 13. Essentially, this creates a multiple regression modelwhere the predicted values are simultaneously as close as possible to the observed values.

When the model has one response variable and one explanatory variable, as inChapter 13, we can think of the relationship as a line, because we are operating in twodimensions (x and y). When we have one response variable and k explanatory variables,we are operating in k�1 dimensions. With two explanatory variables, we can imagine aplane as the regression surface. With more than two explanatory variables, there is noway to picture the regression relationship.

Examining the ResidualsIn Chapter 13, we saw that analysis of the residuals was required to checkwhether the sample data appear to conform to the requirements of the least-squaresregression model. As before, we can legitimately make predictions with the model, orperform hypothesis tests about the relationship between the y- and x-variables, only ifthese requirements are met.

1yi - yNi2

PART V ANALYZING RELATIONSHIPS532

Requirements for Predictions or Hypothesis Tests About the Multiple RegressionRelationship

1. For any specific combination of the x-values, there are many possible values ofy and the residual (or “error term”) �. The distribution of these �-values mustbe normal for any specific combination of x-values. This means that the actualy-values will be normally distributed around the predicted y-values from theregression relationship, for every specific combination of x-values.

2. These normal distributions of �-values must have a mean of zero. The actualy-values will have expected values, or means, that are equal to the predicted y-values from the regression relationship.

3. The standard deviation of the �-values, which we refer to as s�, is the same forevery combination of x-values. The normal distributions of actual y-valuesaround the predicted y-values from the regression relationship will have thesame variability for every specific combination of x-values.

4. The �-values for different combinations of the x-values are not related toeach other. The value of the error term � is statistically independent of anyother value of �.

As in Chapter 13, we create a number of residual plots to check these requirements.If they appear to be met in the sample data, we will assume they are met in the popula-tion. As usual, Excel is a great help in creating the required graphs. As before, in theRegression dialogue box, you should tick Residuals, Standardized Residuals, andResidual Plots. As in Chapter 13, you should create a histogram of the residuals, andyou should plot the residuals against time if you have time-series data.

Variation in the Residuals Is Constant A plot of the residuals against thepredicted values from the model can give us an indication of whether the variability of

14Ch14_Skuce.QXD 1/16/10 5:47 AM Page 532

Page 9: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

The requirement is that the variability of the error term is constant, so a residualplot with constant variability (a horizontal band, centred on zero vertically) is ideal.This residual plot does not show any particular pattern, and appears as a horizontal

CHAPTER 14 ANALYZING LINEAR RELATIONSHIPS, TWO OR MORE VARIABLES 533

EXHIBIT 14.5Excerpt of Excel Regression Output for Woodbon, Showing Residuals

the error term is constant. Such a plot can be created from the information created byExcel in the Residual Output, an excerpt of which is shown below in Exhibit 14.5 (notethat some of the rows of data have been hidden in the worksheet).

EXHIBIT 14.6Plot of Residuals Against Predicted Sales, Woodbon

The plot of residuals against predicted values is shown below in Exhibit 14.6. In Excel2007, it is quite easy to produce this scatter diagram. Simply highlight the two adjacentcolumns (Predicted Sales and Residuals, in this case), and Insert a Scatter diagram.

14Ch14_Skuce.QXD 1/16/10 5:48 AM Page 533

Page 10: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

band. There is one residual that is unusually high (it is circled on the plot). This pointcorresponds to the data point for 1991.

It can also be helpful to plot the residuals against each individual x-variable, particu-larly if there appears to be a problem with the plot of the residuals against the predictedvalues. The additional plots can indicate which explanatory variables might be the sourceof any problem.

When Residual Plots is ticked as an option in Excel’s Data Analysis tool forRegression, graphs are automatically created to show residuals against every x-variablein the model. The graphs for the Woodbon model are shown in Exhibit 14.7 below. Notethat the graphs have been resized for visibility.

PART V ANALYZING RELATIONSHIPS534

EXHIBIT 14.7Residual Plots for Woodbon Multiple Regression Model

The mortgage rates residual plot has the desired horizontal band appearance,although there is one point (circled on the plot) where a residual seems unusually high.This is the data point for 1991, the same point that stood out in Exhibit 14.6.

The housing starts residual plot exhibits fairly constant variability in the residuals,although again there is one point (circled) that gives an unusually high residual. Again,this is the data point from 1991.

14Ch14_Skuce.QXD 1/16/10 5:48 AM Page 534

Page 11: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

The advertising expenditure residual plot has something of a horizontal band appear-ance, although there does not seem to be as much variability in the residuals when the adver-tising expenditure is higher. Again, the data point for 1991 shows an unusually high residual.

The leading indicator residual plot gives the greatest cause for concern. This plotshows reduced variability in the error term for higher values of the leading indicator, andthe pattern is more pronounced than for the advertising expenditure residual plot,although it is also affected by the 1991 data point. Remember that the scatter plot of salesagainst the leading indicator looked non-linear. The residual plot is consistent with thecurved scatter diagram that we saw earlier. Since we are trying to build a linear multipleregression model, we may not be able to use this variable in its present form.

Independence of Error Terms Plotting the residuals in the time order in whichthe data occurred allows us to check if the error terms are related over time. TheWoodbon data set was arranged by year, so the residuals are also arranged by year.

A plot of the residuals over time is shown in Exhibit 14.8 below.

CHAPTER 14 ANALYZING LINEAR RELATIONSHIPS, TWO OR MORE VARIABLES 535

EXHIBIT 14.8Plot of Residuals Over Time, Woodbon Model

There does not appear to be any particular pattern in the residuals over time, and soit is reasonable to conclude that the residuals are independent over time.

Normality of Residuals A histogram is created for the residuals, to check for normality.The histogram for the initial Woodbon model is shown on the next page in Exhibit 14.9.

The histogram appears to be approximately normal, although it is somewhatskewed to the right. As well, the histogram appears to be centred approximately aroundzero, which is desirable.

Outliers and Influential Observations Outliers, that is, observations that are farfrom other observations, should always be investigated. Such points may be the result of

14Ch14_Skuce.QXD 1/16/10 5:48 AM Page 535

Page 12: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

an error in observing or recording the data. If that is the case, and they are not correctedor removed, the result will be a model that is less correct than it could be. As before, ageneral rule is to investigate any observation with a standardized residual that is ≥ �2 or≤ �2. If you examine the Residual Output for the Woodbon model (see Exhibit 14.5 onpage 533), you will see that there is one point that would be identified as an outlier, thatis, observation 12. It is no surprise that data point 12 is the observation for 1991, givenhow often it has shown up as an unusual point in the residual plots. This data point isaccurately recorded. There is no obvious reason why it does not belong in the data set.Therefore, we will not discard it.

An influential observation is one that has an extreme effect on the regression model.In the simple linear regression model discussed in Chapter 13, we could use scatter plots toidentify such values. Influential observations are more difficult to locate in the multipleregression model, because the influence might come from just one x-variable, or a combi-nation of them. There are several techniques available to help identify influential observa-tions, and they can be found in more advanced texts. If you suspect that an observation ishaving an undue influence on the regression model, one way to check is to recalculate themodel without the suspect observation. If the regression coefficients change significantly(a judgment call), then the observation is influential.

What If the Required Conditions Are Not Met? The hypothesis tests and confi-dence intervals that will be described in Sections 14.3 and 14.5 are valid only if the dataappear to meet the requirements for the linear regression model. If they do not, furtherwork must be done before hypothesis tests are done or confidence intervals are calculated.

There exist more advanced techniques that could be used to solve some of the prob-lems that arise. For example, it may be possible to transform the data by applying somemathematical function (such as a logarithm or square root) to the original data, and workwith the new measurements. It may also be possible to build a useful model without includ-ing the variables that are responsible for the conditions not being met. It may be necessary

PART V ANALYZING RELATIONSHIPS536

EXHIBIT 14.9Histogram of Residuals for Woodbon Multiple Regression Model

14Ch14_Skuce.QXD 1/16/10 5:48 AM Page 536

Page 13: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

to start over, to try to find explanatory variables that meet the requirements. It may also beuseful to proceed, if the violation of the required conditions is not too pronounced. If theresulting model provides useful predictions, it may be the best we can do.

Before we proceed with our regression analysis, we will remove the leadingindicator as an explanatory variable, as it does not meet the requirements for linearregression in its present form. In particular, the variability in the residuals is not con-stant. Especially when we are just beginning our analysis of a model, we should notnecessarily discard explanatory variables that do not meet the requirements of alinear regression model, especially when they seem to be reasonable choices. As men-tioned above, we may be able to transform the leading indicator data so that themodel does meet the requirements for linear regression. However, the leading indica-tor data also presents other difficulties (see Section 14.5), so we will drop it now tostreamline the discussion. As we will see later, choosing the best explanatory variablesfor any model is an art.

Once we use Excel to create a new regression model, we will see that the model bettermeets the required conditions for linear regression. Example 14.2 illustrates.

CHAPTER 14 ANALYZING LINEAR RELATIONSHIPS, TWO OR MORE VARIABLES 537

Checking conditions for linear multiple regression

Use Excel to re-specify the multiple regression relationship between Woodbon salesand mortgage rates, housing starts, and advertising expenditure. Check to see that thenew model meets the required conditions for hypothesis tests and confidence intervals.

The regression output for the new model is shown in Exhibit 14.10 below.

EXAMPLE 14.2

EXHIBIT 14.10Regression Output for Woodbon Model, with Mortgage Rates, Housing Starts, and AdvertisingExpenditure as Explanatory Variables

14Ch14_Skuce.QXD 1/16/10 5:48 AM Page 537

Page 14: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

From the output, we can see that the regression relationship has become(approximately):

Woodbon annual sales � $80,640.57 � $3,521.91 (mortgage rate) � $5.48 (housing starts) � $23.41 (advertising expenditure)

The various residual plots for the revised model all appear to conform to the requiredconditions. Exhibit 14.11 shows the revised predicted sales residual plot, which appearsto have the desirable horizontal band of points.

PART V ANALYZING RELATIONSHIPS538

EXHIBIT 14.11Predicted Sales Residual Plot for Woodbon Model, with Mortgage Rates, Housing Starts,and Advertising Expenditure as Explanatory Variables

As well, none of the residual plots for the three explanatory variables give an indicationof violation of the required conditions. Exhibit 14.12 opposite shows these residual plots.

The plot of the residuals over time does not exhibit any particular pattern, so wecan conclude that the residuals are independent over time. Exhibit 14.13 illustrates.

Finally, the histogram of residuals for the new Woodbon model appears fairlynormal, although there is some right-skewness. Exhibit 14.14 on page 540 illustrates.

A check of the standardized residuals reveals one data point that could be classifiedas an outlier. This is the data point that corresponds to 1991 (notice that this point alsoattracted our attention when we examined the residual plots for the original model).Since the data are correct, we will leave the point in the data set. It appears that 1991 wasnot a typical year for Woodbon.

14Ch14_Skuce.QXD 1/16/10 5:48 AM Page 538

Page 15: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

CHAPTER 14 ANALYZING LINEAR RELATIONSHIPS, TWO OR MORE VARIABLES 539

EXHIBIT 14.12Residual Plots for Woodbon Multiple Regression Model (Mortgage Rates, Housing Starts, Advertising Expenditure as Explanatory Variables)

EXHIBIT 14.13Plot of Residuals Over Time, Woodbon Model (Mortgage Rates, Housing Starts, Advertising Expenditure as Explanatory Variables)

14Ch14_Skuce.QXD 1/16/10 5:48 AM Page 539

Page 16: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

PART V ANALYZING RELATIONSHIPS540

EXHIBIT 14.14Histogram of Residuals for Woodbon Model (Mortgage Rates, Housing Starts, AdvertisingExpenditure as Explanatory Variables)

The new Woodbon model appears to meet all of the required conditions for the lin-ear regression model. However, it still does not meet the test of common sense, in thatthe coefficient for housing starts is negative when we would expect it to be positive. Wewill say more about this difficulty in Section 14.5, when we discuss criteria for selectingexplanatory variables.

G U I D E T O T E C H N I Q U E

Checking Requirements for the Linear MultipleRegression ModelWhen:

• before performing hypothesis tests or using the regression relationship to create con-fidence or prediction intervals

• using sample data to assess whether the relationship conforms to requirements

Steps:1. Produce scatter diagrams for the relationship between each explanatory variable and the

response variable. Check to see that each relationship appears linear (no pronouncedcurvature).

2. Use Excel’s Regression tool to produce the Residuals, Standardized Residuals,and Residual Plots.

14Ch14_Skuce.QXD 1/16/10 5:48 AM Page 540

Page 17: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

CHAPTER 14 ANALYZING LINEAR RELATIONSHIPS, TWO OR MORE VARIABLES 541

3. Create a plot of the residuals versus the predicted y-values. The residuals should berandomly distributed around zero, with the same variability throughout.

4. Examine the plots of residuals versus each explanatory variable. Again, the residualsshould be randomly distributed around zero, with the same variability (a randomhorizontal band appearance is desirable).

5. Check time-series data by plotting the residuals in time order. There should be no dis-cernible pattern to the plot.

6. Create a histogram of the residuals. This should be approximately normal, and centredon zero.

7. Check for outliers and influential observations. Carefully check any data point with astandardized residual ≥ �2 or ≤ �2.

Note: If these investigations indicate significant problems, you should not proceed with ahypothesis test of the significance of the model, and you should not create confidenceintervals or prediction intervals with the model in its current form.

DEVELOP YOUR SKILLS 14.2The Develop Your Skills exercises in this chapter frequentlyrefer to a data set called “Salaries.”

6. Examine the residual plots produced by Excel for theSalaries multiple regression model that you built forDevelop Your Skills 14.1, Exercise 2, which included allpossible explanatory variables. Are these residual plotsconsistent with the required conditions? Create a plotof the residuals versus predicted salaries. Does this plotmeet the required conditions?

7. Examine the residual plots produced by Excel for theSalaries multiple regression model that you built forDevelop Your Skills 14.2, Exercise 4, which included yearsof postsecondary education and age as explanatory vari-ables. Are these residual plots consistent with the requiredconditions? Create a plot of the residuals versus predictedsalaries. Does this plot meet the required conditions?

8. Examine the residual plots produced by Excel for theSalaries multiple regression model that you built forDevelop Your Skills 14.2, Exercise 5, which includedyears of postsecondary education and years of experi-ence as explanatory variables. Are these residual plotsconsistent with the required conditions? Create a plotof the residuals versus predicted salaries. Does this plotmeet the required conditions?

9. Create histograms of the residuals for the models dis-cussed in Exercises 6, 7, and 8 above. Do these histogramsappear to be at least approximately normal?

10. Check the Excel output for the models created inExercises 6, 7, and 8 above for outliers. If you had accessto the original records for this data set, what wouldyou do?

SALARIES

HOW GOOD IS THE REGRESSION?Because the new Woodbon model appears to meet the required conditions, we can nowconduct hypothesis tests about the overall model, and about the individual explanatoryvariables. We begin by testing whether the regression model is significant. Given thissample data set, is there evidence that there is a population regression relationshipbetween sales and at least one of the explanatory variables?

14.3

14Ch14_Skuce.QXD 1/16/10 5:48 AM Page 541

Page 18: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

Is the Regression Model Significant?—The F-TestIn the discussion of simple linear regression (Chapter 13), we performed a test ofhypothesis about the slope of the regression line.

In multiple regression, we test the model as a whole.

H0: b1 � b2 � � � � � bk � 0

H1: At least one of the bi’s is not zero.

If the null hypothesis is true, then the y-variable is not related to any of the x-variables.If the alternative hypothesis is true, then the y-variable is related to at least one of the x-variables. As in Chapter 13, we hope to reject the null hypothesis, so that we can con-clude there is a significant relationship between the response variable and at least one ofthe explanatory variables.

We conduct the hypothesis test by examining how much of the variation in the y-variable is explained by the regression relationship. Remember from Chapter 13that the total variation in the y-values can be broken down into two parts: the vari-ation that is explained by the regression relationship, and the variation that is leftunexplained.

It is usual to describe this relationship as follows:

SST � SSR � SSE

The total sum of squares (SST) is equal to the sum of squares explained by theregression (SSR) plus the residual (or error) sum of squares (SSE).

When the response variable (y) is related to the explanatory variables, then SSRwill be relatively large, and SSE will be relatively small. Before we can compareSSR and SSE, we must adjust them so they are directly comparable. This is accom-plished by dividing each by its degrees of freedom to calculate the associated meansquare value.4

The degrees of freedom for the error sum of squares are n – (k � 1), because weestimate k coefficients plus an intercept from n data points. The degrees of freedomfor the total variation are (n – 1). This leaves k degrees of freedom for the regressionsum of squares.

The test statistic is the ratio of the mean squares, and is an F statistic. The F distri-bution is described on pages 412–419 in Chapter 11.

F =

SSR

k

SSE

n - 1k + 12

= MSR

MSE

©(y - yq)2= ©(yN - yq)2

+ ©(y - yN)2

PART V ANALYZING RELATIONSHIPS542

4 If this seems familiar, it should. We made the same adjustment to compare mean squares in Chapter 11.See page 411.

14Ch14_Skuce.QXD 1/16/10 5:48 AM Page 542

Page 19: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

When the null hypothesis is true, and the response variable is not related to any ofthe explanatory variables, the mean square for the regression (MSR) will not be signifi-cantly larger than the mean square for error (MSE), and the F statistic will be relativelysmall. However, when the response variable is related to at least one of the explanatoryvariables, the MSR will be significantly larger than the MSE, and the F statistic will berelatively large. The question is, how large does the F statistic have to be to provideevidence of a significant relationship?

Of course, the answer depends on sampling variability. As usual, we have to referto the sampling distribution of the F statistic to decide whether any specific F statisticis unusual enough for us to reject the null hypothesis. The sampling distribution ofthe F statistic depends on the number of data points and the number of explanatoryvariables.

CHAPTER 14 ANALYZING LINEAR RELATIONSHIPS, TWO OR MORE VARIABLES 543

The Sampling Distribution of in Linear Multiple Regression Models

The sampling distribution of follows the F distribution, with (k, n – (k � 1))degrees of freedom, where n is the number of observed data points and k is thenumber of explanatory variables in the model.

MSRMSE

MSRMSE

Fortunately, the Excel output not only calculates the F statistic for the hypothesis test ofthe regression model, it also calculates the associated p-value.

Hypothesis test of significanceof regression model

Complete the hypothesis test for the significance of the Woodbon model basedon mortgage rates, housing starts, and advertising expenditure. Use a 5% level ofsignificance.

H0: b1 � b2 � � � � � bk � 0

H1: At least one of the bi’s is not zero.

� � 0.05

The Excel output of the regression model is reproduced on the next page inExhibit 14.15, for ease of reference.

From the output, we see that F � 153.9 and the p-value is 0.000000000000000836.Since the p-value is less than the level of significance, we reject H0. There is strong evidenceto infer that there is a significant relationship between Woodbon sales and at least one ofthe explanatory variables. As always, we remember that a significant relationship is notnecessarily a cause-and-effect relationship. Some other factor may be the cause of associatedchanges in sales and the explanatory variables.

EXAMPLE 14.3A

14Ch14_Skuce.QXD 1/16/10 5:48 AM Page 543

Page 20: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

Are the Explanatory Variables Significant?—The t-TestIf the hypothesis test of the overall regression model indicates a significant relationshipbetween the response variable and at least one of the explanatory variables, the next stepis to figure out which of the explanatory variables is significant.

We conducted a t-test about the slope of the regression line in Chapter 13. We testthe individual coefficients in the multiple regression model in a similar fashion. The testof the coefficient of explanatory variable i is conducted as follows.

H0: bi � 0

H1: bi ≠ 0

The test statistic is , with (n � (k � 1)) degrees of freedom.

It is important to realize that this t-test for the significance of each explanatoryvariable assumes that all the other explanatory variables are included in the model. Thep-values for the individual coefficients do give us some indication of how importanteach explanatory variable is. Those with small p-values are likely more strongly relatedto the response variable. However, we cannot just eliminate an explanatory variable

t = bi

sbi

PART V ANALYZING RELATIONSHIPS544

EXHIBIT 14.15Excel Regression Output for Woodbon Model, with Mortgage Rates, Housing Starts, and AdvertisingExpenditure as Explanatory Variables

14Ch14_Skuce.QXD 1/16/10 5:48 AM Page 544

Page 21: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

with a p-value greater than the level of significance. If we decide to eliminate anyexplanatory variable, we must rerun the regression analysis and examine the newmodel and the new p-values for the coefficients.

Remember that the t-tests for the individual coefficients should only be conductedif the F-test of the overall model shows that it is significant. We can control the Type Ierror rate on a single t-test with the level of significance (�), but the error rate becomeslarger with repeated tests based on the same data set.5 Therefore, the individual t-testsshould only be performed when the overall Type I error rate is controlled, through theF-test.

CHAPTER 14 ANALYZING LINEAR RELATIONSHIPS, TWO OR MORE VARIABLES 545

Hypothesis tests of individual coefficients in regression model

Conduct hypothesis tests about the significance of the individual coefficients in theWoodbon model.

As in the simple linear regression case, the Excel output contains the p-values for thetwo-tailed tests of significance for the individual coefficients. An excerpt from the Exceloutput is shown below in Exhibit 14.16.

EXAMPLE 14.3B

EXHIBIT 14.16Excerpt from the Excel Regression Output for Woodbon Model, with Mortgage Rates, HousingStarts, and Advertising Expenditure as Explanatory Variables

5 This problem was discussed in Chapters 10 and 11—be careful when you are skating back and forthacross the frozen lake!

For convenience, we will refer to the mortgage rates as explanatory variable 1, housingstarts as explanatory variable 2, and advertising expenditure as explanatory variable 3.

Hypothesis test for mortgage rates:

H0: b1 � 0

H1: b1 ≠ 0

The p-value is 0.000026, so we reject H0. There is strong evidence that mortgagerates are a significant explanatory variable for Woodbon annual sales, when housingstarts and advertising expenditure are included in the model.

t = b1

sb1

= –5.18 1from Excel output2

14Ch14_Skuce.QXD 1/16/10 5:48 AM Page 545

Page 22: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

Hypothesis test for housing starts:

H0: b2 � 0

H1: b2 ≠ 0

The p-value is 0.005, so we reject H0. There is strong evidence that housing startsare a significant explanatory variable for Woodbon annual sales, when mortgage ratesand advertising expenditure are included in the model.

Hypothesis test for advertising expenditure:

H0: b3 � 0

H1: b3 ≠ 0

The p-value is 0.0000001, so we reject H0. There is strong evidence that advertisingexpenditure is a significant explanatory variable for Woodbon annual sales, when mortgagerates and housing starts are included in the model.

As always, while we can conclude that mortgage rates, housing starts, andadvertising expenditure are significant explanatory variables for Woodbon annualsales, this does not mean that we can conclude that changes in these variables havecaused the changes in sales.

Adjusted Multiple Coefficient of DeterminationIn Chapter 13, we used the coefficient of determination, or R2, as an indication of howwell the x-variable explained the variations in the y-variable. Remember, that

Adding more explanatory variables to the regression model will never reduce the R2

value, and generally will tend to increase it. In fact, if you have n data points, it is alwayspossible to develop a model that will fit the data perfectly, with n – 1 explanatory vari-ables. However, this model is not likely to yield good predictions, because it is only theresult of a lot of arithmetic instead of good thinking about the relationships between theresponse and explanatory variables. Such a model is usually described as “overfitted.” It ispossible to adjust the R2 value to compensate for this tendency of R2 to increase whenanother explanatory variable is added to the model.

It is easiest to see the relationship between the R2 and the adjusted R2 if we startwith a restatement of R2. We know SST � SSR � SSE, so SSR � SST – SSE. Substitutingthis into the formula for R2 yields the following:

R2=

SSR

SST =

SST - SSE

SST = 1 -

SSE

SST

R2=

SSR

SST

t =

b3

sb3

= 7.42 1from Excel output2

t = b2

sb2

= –3.08 1from Excel output2

PART V ANALYZING RELATIONSHIPS546

14Ch14_Skuce.QXD 1/16/10 5:48 AM Page 546

Page 23: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

The adjusted R2 is calculated as follows:

The adjusted R2 is calculated by Excel as part of the Regression output.The adjusted R2 value will generally be smaller than the unadjusted R2 value. As

well, because the formula takes into account the number of explanatory variables beingused (k), the adjusted R2 will not necessarily increase when another variable is added tothe model.

Excel’s Regression output provides the adjusted R2 value. For the Woodbon model,it is 0.944, as shown below in Exhibit 14.17. Notice that the adjusted R2 value, at 0.944,is less than the R2 value of 0.951.

Adjusted R2= 1 -

SSEn - 1k + 12

SSTn – 1

CHAPTER 14 ANALYZING LINEAR RELATIONSHIPS, TWO OR MORE VARIABLES 547

EXHIBIT 14.17Excel Regression Statistics for Woodbon Model, with Mortgage Rates, Housing Starts,and Advertising Expenditure as Explanatory Variables

DEVELOP YOUR SKILLS 14.3The Develop Your Skills exercises in this chapter frequentlyrefer to a data set called “Salaries.”11. Apply the formula for the adjusted R2 to verify the

value shown in Exhibit 14.17. Note that the SSE andSST are shown in the Excel Regression output.

12. Conduct a test of the significance of the overall modelfor the salaries model which includes all explanatory

variables. Test the significance of the individual explana-tory variables. What does this tell you?

13. Conduct a test of the significance of the overall modelfor the salaries model that includes years of postsec-ondary education and age as explanatory variables. Testthe significance of the individual explanatory variables.What does this tell you?

SALARIES

14Ch14_Skuce.QXD 1/16/10 5:48 AM Page 547

Page 24: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

14. Conduct a test of the significance of the overall model forthe salaries model that includes years of postsecondaryeducation and years of experience as explanatory vari-ables. Test the significance of the individual explanatoryvariables. What does this tell you?

15. Compare the adjusted R2 values for the three models ofsalary from Exercises 12, 13, and 14 above. Based on theadjusted R2, which model does not seem worth consid-ering at this point?

MAKING PREDICTIONSOne of the reasons for building a multiple regression model for Woodbon annualsales was to allow Kate Cameron, Woodbon’s owner, to make sales predictions. Ofcourse, the only explanatory variable in the present model that Kate can control isadvertising expenditure. Kate will have to guess at the values of the other variables(mortgage rates and housing starts) if she wants to predict sales for the coming year.Suppose Kate plans to spend $3,000 on advertising next year, and she expects thatfive-year mortgage rates will be around 7% and that housing starts for the provincewill be 3,800.

By substituting these specific values into the regression equation, Kate arrives at apoint estimate for Woodbon annual sales.

Woodbon annual sales � $80,640.56739 � $3,521.91472 (mortgage rate) � $5. 47726(housing starts) � $23.41351(advertising expenditure)

� $80,640.56739 � $3,521.91472 (7) � $5.47726 (3,800) � $23.41351 (3,000)

� $105,414.11

In Chapter 13, we also created prediction and confidence intervals from regression rela-tionships. Remember, a regression prediction interval predicts a particular value of y(sales) for a set of specific values of the x-variables (in this case, mortgage rate, housingstarts, and advertising expenditure). A regression confidence interval predicts the averagey for a set of specific values of the x-variables.

While the formulas for prediction and confidence intervals were fairly simple tounderstand when there was only one explanatory variable (with a given value of x0),they become more complicated with two or more explanatory variables. Constructingthese intervals for multiple regression requires the use of matrix algebra. An Exceladd-in (Multiple Regression Tools) has been created to do these calculations. Thisadd-in was first introduced in Chapter 13 (see page 514).

You should type the specific values of the explanatory variables that will be thebasis of your intervals into adjacent columns in the spreadsheet containing the sampledata. It is easiest to input the values in the correct order if they are typed at the bottomof the columns of explanatory variable data. Exhibit 14.18 illustrates (note that some ofthe rows of data have been hidden in the worksheet).

14.4

PART V ANALYZING RELATIONSHIPS548

14Ch14_Skuce.QXD 1/16/10 5:48 AM Page 548

Page 25: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

As before, the Confidence Interval and Prediction Intervals – Calculations tool inMultiple Regression Tools requires you to indicate the locations of the labels and valuesof the variables, the location of the specific values of the explanatory variables on whichyou want to base your intervals, a level of confidence (percentage form), and an outputrange. Example 14.4 below provides the results.

CHAPTER 14 ANALYZING LINEAR RELATIONSHIPS, TWO OR MORE VARIABLES 549

EXHIBIT 14.18Typing in Specific Values of Explanatory Variables as Basis for Intervals

Calculating confidence and prediction intervals with Excel

Use Excel to create a prediction interval for Woodbon sales, when mortgage rates are7%, housing starts are 3,800, and advertising expenditure is $3,000.

Once these specific values are typed into the spreadsheet, the Multiple Regression Toolsadd-in is used to create the following output (note that columns have been resized forvisibility).

EXAMPLE 14.4

EXHIBIT 14.19Confidence Interval and Prediction Intervals – Calculations Result for Woodbon Data

14Ch14_Skuce.QXD 1/16/10 5:48 AM Page 549

Page 26: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

Remember, it is not legitimate to make predictions from the regression model forvalues of the explanatory variables that are outside the range of the values in the data seton which the model is based. So, for example, Kate should not rely on the model to makepredictions for an advertising budget of $5,000, because Woodbon has never spent morethan $3,500 on advertising in the past.

PART V ANALYZING RELATIONSHIPS550

DEVELOP YOUR SKILLS 14.4The Develop Your Skills exercises in this chapter frequentlyrefer to a data set called “Salaries.”16. Use Excel to create a 95% confidence interval of aver-

age Woodbon sales, when mortgage rates are 6%, hous-ing starts are 3,500, and advertising expenditure is$3,500. Interpret the interval.

17. Would it be appropriate to use the Woodbon model tomake a prediction for mortgage rates of 6%, housingstarts of 2,500, and advertising expenditure of $4,000?Explain why or why not.

18. Use the salaries model based on years of postsecondaryeducation and age to make a 95% prediction intervalestimate of the salary of an individual who is 35 yearsold and has five years of postsecondary education.

19. Use the salaries model based on years of postsecondaryeducation and age to make a 95% confidence intervalestimate of the average salary for individuals who are35 years old and have five years of postsecondary edu-cation. Do you expect this confidence interval to bewider or narrower than the prediction interval estimatefrom Exercise 18? Why?

20. Use the salaries model based on years of postsecondaryeducation and years of experience to make a 95% pre-diction interval estimate of the salary of an individualwho has five years of postsecondary education and10 years of experience.

SALARIES

A 95% prediction interval for Woodbon sales when mortgage rates are 7%, housingstarts are 3,800, and advertising expenditure is $3,000, is ($91,016.42, $119,811.81).Notice that even though we have added explanatory variables to the model, and the fit isbetter than it was in the simple linear regression model, the prediction interval is stillquite wide.

SELECTING THE APPROPRIATEEXPLANATORY VARIABLES

Let us recap the steps we have followed to build the Woodbon sales model.

1. We began by thinking carefully about what explanatory variables might reasonablybe expected to have an impact on Woodbon’s sales, and we examined the relation-ship between each of these variables and sales.

14.5

14Ch14_Skuce.QXD 1/16/10 5:48 AM Page 550

Page 27: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

2. We used Excel to estimate the multiple regression relationship between Woodbonsales and mortgage rates, housing starts, and a leading indicator for retail trade infurniture and appliances. We noted that the initial model had a negative coefficientfor housing starts, the opposite of what we expected.

3. We used Excel to check whether the regression model met the required conditions.We examined plots of residuals against predicted sales, and residuals against eachof the explanatory variables. We noted that the residual plot for the leading indica-tor did not exhibit the desired horizontal band shape. We also examined a plot ofresiduals over time, which did not exhibit any particular time-related pattern. Wecreated a histogram of the residuals, and it appeared normal. We identified oneoutlier in the data set.

4. The model that included the leading indicator did not conform to the requiredconditions for linear regression (the variability in the residuals was not constant).For this and other reasons, we eliminated the leading indicator as an explanatoryvariable and recalculated the model. Again, we checked the required conditions,and this time did not identify any obvious violations of the required conditions. Weidentified one potential outlier, but because the associated data point was correct,we left it in the model.

5. We conducted an F-test of the significance of the overall model, and we were able toconclude there was a significant relationship between Woodbon sales and at leastone of the remaining explanatory variables (mortgage rates, housing starts, andadvertising expenditure).

6. We then conducted hypothesis tests about the coefficients for each explanatoryvariable. In each case, we concluded that there was evidence that each of theexplanatory variables was significant, assuming the other explanatory variableswere included in the model.

7. We examined the adjusted R2 value, which was 0.94, indicating that the regressionmodel explained a significant portion of the variability in Woodbon sales.

At this point, it might seem that we have done enough work and that we have thebest possible model for Woodbon’s sales, based on these explanatory variables. However,we must reflect carefully before we settle for the new model. More complicated modelsare not necessarily better. We should realize that the adjusted R2 value for the newmodel, at 0.94, is not that much higher than the adjusted R2 of 0.88 for the model basedon advertising expenditure alone. This might not be enough of an improvement tojustify the extra data required.

As well, it is possible for the adjusted R2 value to be high for a model that does notpredict well. Ultimately, a model that does not provide useful predictions of future salesfor Woodbon will not be worth maintaining.

Building a good regression model is a process that requires many steps, as we haveseen. Fortunately, computers do the calculations easily and quickly, and so we can lookat a number of possible models before choosing the “best” one. It is not always easyto determine which is the best regression model, but these are some goals to keep in mind.

CHAPTER 14 ANALYZING LINEAR RELATIONSHIPS, TWO OR MORE VARIABLES 551

14Ch14_Skuce.QXD 1/16/10 5:48 AM Page 551

Page 28: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

One method of finding the “best” possible regression model is to create regres-sions for all possible combinations of the explanatory variables being considered. Forthe Woodbon data set, this would entail creating a regression model for each of thefollowing:

1. sales and mortgage rates2. sales and housing starts3. sales and advertising expenditure4. sales and housing starts and mortgage rates5. sales and mortgage rates and advertising expenditure6. sales and housing starts and advertising expenditure7. sales and housing starts and mortgage rates and advertising expenditure.

The resulting models can be assessed according to the criteria above. Some valuesthat may be useful to compare models are as follows:

1. The adjusted R2 values provide a measure of the strength of the relationshipbetween the explanatory variables and the response variable.

2. The standard error (sε) gives some indication of how wide the confidence and pre-diction intervals would be. As discussed in Chapter 13, sε is the sample estimate ofthe standard deviation of the error terms (residuals) in the model. A model withless variability in the error terms will produce narrower and therefore more usefulconfidence and prediction intervals.

3. The number of explanatory variables gives some indication of the data requirementsof the model. Adding an explanatory variable may reduce the width of confidenceand prediction intervals, if it reduces sε. However, adding an explanatory variablewill also decrease the degrees of freedom for the t-score (n – (k � 1)) used in theconfidence and prediction intervals (each additional variable increases k by 1), and t-scores with smaller degrees of freedom are larger (look at the table of t-scores inAppendix 3 on page 581 if you are not sure of this). The larger t-score may at leastpartially offset any reduction in sε that results from adding an explanatory variable.

The Multiple Regression Tools add-in allows you to easily create all possible regres-sion models from a data set. Exhibit 14.20 opposite shows the Multiple RegressionTools dialogue box, with the correct choice highlighted: All Possible RegressionsCalculations.

PART V ANALYZING RELATIONSHIPS552

Goals for Regression Models

1. The model should be easy to use. It should be reasonably easy to acquire datafor the model’s explanatory variables.

2. The model should be reasonable. The coefficients should represent a reason-able cause-and-effect relationship between the response variable and theexplanatory variables.

3. The model should make useful and reliable predictions. Prediction and con-fidence intervals should be reasonably narrow.

4. The model should be stable. It should not be significantly affected by smallchanges in explanatory variable data.

14Ch14_Skuce.QXD 1/16/10 5:48 AM Page 552

Page 29: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

EXHIBIT 14.20Multiple Regression Tools Add-In, All Possible Regressions Calculations

CHAPTER 14 ANALYZING LINEAR RELATIONSHIPS, TWO OR MORE VARIABLES 553

EXHIBIT 14.21All Possible Regressions Dialogue Box

With this choice selected, click OK, and the next dialogue box will be as shown inExhibit 14.21.

You are required to:

1. Input the locations of the response (y) and explanatory (x) variable labels (Sales forthe response variable for the Woodbon data, and Mortgage Rates, Housing Starts,and Advertising Expenditure for the explanatory variables).

2. Input the locations of the response (y) and explanatory (x) variable values.

The output is illustrated in Example 14.5A on the next page.

14Ch14_Skuce.QXD 1/16/10 5:49 AM Page 553

Page 30: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

PART V ANALYZING RELATIONSHIPS554

Assess all possible regressions Use Excel to create all possible regression models for Woodbon sales, using housingstarts (which we will refer to as x1), mortgage rates (x2), and advertising expenditure(x3) as possible explanatory variables.

The output from the All Possible Regressions Calculations results are shown below inExhibits 14.22a and b. The output is quite long, and when you are working in Excel, youwill have to scroll up and down to see everything.

EXAMPLE 14.5A

EXHIBIT 14.22Results of All Possible Regressions for Woodbon Sales Model, with Housing Starts, MortgageRates, and Advertising Expenditure as Explanatory Variables.

a)

14Ch14_Skuce.QXD 1/16/10 5:49 AM Page 554

Page 31: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

CHAPTER 14 ANALYZING LINEAR RELATIONSHIPS, TWO OR MORE VARIABLES 555

b)

So which model is best? While the results shown in Exhibit 14.226 can help us assess thevarious regression models, these values cannot tell the whole story. There is a trade-offwhen variables are added to the model. While the model may have more explanatorypower, it will also be more complicated, and more difficult to maintain.

If we look at Exhibit 14.22, we see that the model with all three explanatory vari-ables has the highest adjusted R2 and the lowest standard error. However, the model hasa negative coefficient for housing starts, which does not seem reasonable. As well, addingthe housing starts variable only slightly improves it from the model with just advertisingexpenditure and mortgage rates.

Whichever model we choose, it is important to check that it meets the requiredconditions described in Section 14.2. An F-test of the significance of the model shouldalso be conducted. While it would be disappointing if the “best” model from all of thepossible regressions did not meet the required conditions, it would not be legitimate touse the model for prediction or confidence intervals if it did not.

By now you can see that building a multiple regression model is an iterative process.It can take some time to build, assess, and ultimately decide on the preferred model.Fortunately, it is fairly easy to explore the possibilities with software such as Excel.

There is one more consideration that is important as we build multiple regressionmodels. Whenever we use more than one explanatory variable, we have to consider thatthere may be interactions among these variables.

6 If your output for Model 5 does not match what is shown in this exhibit, see the note called “Excel’sFloating Point Problem” at the end of the chapter, after the Chapter Review Exercises.

14Ch14_Skuce.QXD 1/16/10 5:49 AM Page 555

Page 32: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

A New Consideration: MulticollinearityCollinearity occurs when two of the explanatory variables are related to each other.Multicollinearity occurs when more than two of the explanatory variables are related toeach other. Generally, this potential problem with multiple regression is referred to as“multicollinearity.”

Multicollinearity may cause one or more of the following problems:

1. The adjusted R2 is large and the F-test shows the overall model is significant, butone or more of the estimated regression coefficients are statistically insignificant.

2. The estimated regression coefficients are not stable. The values change significantlywhen explanatory variables are added to the regression relationship.

3. The estimated regression coefficients do not make sense. They are larger or smallerthan would seem appropriate, or they have an unexpected sign.

Because of these problems, it is important to consider multicollinearity when building amultiple regression model. Some degree of multicollinearity is present in almost everymultiple regression model.

The first method of guarding against multicollinearity is to choose the explana-tory variables carefully. If mortgage rates, for instance, are being considered as anexplanatory variable in the Woodbon model, it would not make sense to includeprime rates or another mortgage rate in the model. Because such variables are likelyhighly correlated with each other, most of the explanatory power is gained when thefirst variable is introduced. The second variable will not likely tell us more about theresponse variable.

There are various methods aimed at identifying collinear variables if the relationshipbetween them is not immediately obvious. One of these is to create a scatter diagram ofthe relationship of every explanatory variable with every other explanatory variable.

In the Woodbon model, if we are thinking of choosing the model with advertisingexpenditure and mortgage rates, we could create a scatter diagram of advertising expen-diture and mortgage rates. Such a scatter diagram is illustrated in Exhibit 14.23 opposite.It appears there is a fairly strong negative correlation between the two variables.

Another method of assessing the correlation between the explanatory variables is tocreate a correlation matrix for the variables. This is easy to do with Excel, using theCorrelation tool of Data Analysis. If you select the adjacent columns of data, Excel willproduce a correlation matrix such as the one shown for the Woodbon problem inExhibit 14.24. Note that it is helpful to select the labels along with the data when usingthis Excel tool.

The correlation coefficients tell us something about how the variables are related aspairs. Of course, it is also possible that one explanatory variable could be simultaneouslyrelated to two other explanatory variables, and neither the scatter diagrams nor thecorrelation matrix will reveal this. The scatter diagrams and the correlation matrix willhelp identify obvious pair-wise correlations between variables, but other harder-to-identify sources of multicollinearity may also be present.

Whenever the correlation coefficients are close to 1 or –1, there are potential prob-lems with multicollinearity. For example, in Exhibit 14.24 opposite, we see a correlationcoefficient of 0.946 between advertising expenditure and the leading indicator. If we hadnot already eliminated the leading indicator as a potential explanatory variable, we wouldprobably have eliminated it because of its high correlation with advertising expenditure.

PART V ANALYZING RELATIONSHIPS556

14Ch14_Skuce.QXD 1/16/10 5:49 AM Page 556

Page 33: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

Including the leading indicator variable in the model would have robbed advertisingexpenditure of its explanatory power.

The collinearity between advertising expenditure and the leading indicator isunexpected. There is no obvious reason for the two variables being connected.However, the mathematical connection is clear, and because of it we should notinclude both variables in the model.

The correlation coefficient between mortgage rates and advertising expenditure is–0.8424, confirming what we saw in the scatter diagram: there is a fairly strong negative rela-tionship between the two variables. Does this mean we should reject this regression model?

The answer depends on how Kate Cameron intends to use the model. Because of thecollinearity between mortgage rates and advertising expenditure, she should be carefulinterpreting the regression coefficients. In the Woodbon sales model that includes mort-gage rates and advertising expenditure, y � $59,670.88 � $3,132.53x2 � 22.74x3, the x3coefficient is 22.74. Normally, we would interpret this coefficient as follows: for a given

CHAPTER 14 ANALYZING LINEAR RELATIONSHIPS, TWO OR MORE VARIABLES 557

EXHIBIT 14.23Scatter Diagram of Mortgage Rates and Advertising Expenditure

EXHIBIT 14.24Correlation Matrix for Woodbon Variables

14Ch14_Skuce.QXD 1/16/10 5:49 AM Page 557

Page 34: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

PART V ANALYZING RELATIONSHIPS558

Multiple regression, dealingwith collinearity

A sales manager is trying to build a model to predict sales by region. He has data onpopulation and total income in each region. The manager begins by creating a regres-sion model including both total income and population as explanatory variables. Anexcerpt from the Excel output is shown below in Exhibit 14.25. Examine the outputand explain the results.

EXAMPLE 14.5B

The regression model has an adjusted R2 value of 0.485. The model is significant, accord-ing to the F-test (the p-value is approximately zero). However, the p-values for each of theexplanatory variables are quite large, indicating that none of them is significant. Sincethese are the only variables in the model, this does not make sense.

mortgage rate, each additional dollar spent on advertising increases Woodbon sales by$22.74. However, because of the collinearity, such an interpretation is not reliable. If Katewants to know the true nature of the relationship between Woodbon sales and advertising,she should not rely on a model that includes both advertising expenditure and mortgagerates. However, if Kate’s only purpose is to predict Woodbon’s sales, this model is still proba-bly the best, because it has a high adjusted R2 and a fairly low standard error.

Example 14.5B below discusses a case in which there is strong collinearity betweenexplanatory variables, and suggests a way to deal with the problem to improve theregression model.

EXHIBIT 14.25Regression Output for Example 14.5B

EXA14-5b

14Ch14_Skuce.QXD 1/16/10 5:49 AM Page 558

Page 35: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

USING INDICATOR VARIABLESIN MULTIPLE REGRESSION

The regression analysis discussed so far has investigated quantitative explanatory vari-ables. Sometimes we are interested in the effect that a qualitative characteristic mighthave on a response variable (for example, male/female, urban/rural). It is possible toinclude such information in regression analysis with the use of indicator variables(sometimes called “dummy” variables). If the qualitative variable we are interested in isbinary, that is, it has only two categories, then we can represent it with a single indicatorvariable (for example, “0” for male, “1” for female).

Indicator Variables for Qualitative Explanatory Variables with Only TwoCategories Once the qualitative variable is coded, it can be treated like any othervariable in the regression analysis. For example, suppose we have data on the incomeand gender of the head of household for a random sample of credit card holders, as wellas the monthly credit card bill.

The Excel Regression output for the data set, with both explanatory variables, isshown on the next page in Exhibit 14.26.

14.6

CHAPTER 14 ANALYZING LINEAR RELATIONSHIPS, TWO OR MORE VARIABLES 559

Total income for a region will of course be closely related to the population of theregion. Regions with larger populations will have greater total incomes. In this data set, thecorrelation coefficient between total income and population is 0.936. If the sales managerwants to investigate the effect of income on sales, he should take out the population effect byworking with per capita income. Adjusting the income data and exploring the new regres-sion model are left to the reader as an exercise (see Develop Your Skills 14.5, Exercise 22).

DEVELOP YOUR SKILLS 14.5The Develop Your Skills exercises in this chapter frequentlyrefer to a data set called “Salaries.”21. Does the Woodbon model that includes mortgage rates

and advertising expenditure meet the required condi-tions for regression? If so, conduct an F-test on the sig-nificance of the model.

22. Adjust the total income data for Example 14.5B, andanalyze the new regression model that includes bothper capita income and population. Is the model signifi-cant? Are both explanatory variables significant?Continue your analysis of the data and decide on thebest regression model for this data set.

23. Create a correlation matrix for the variables in theSalaries data set. Discuss which explanatory variables

should not be used simultaneously, and which lookmost promising to explain salaries.

24. Create all other possible regression models for theSalaries data. The models will be based on the follow-ing explanatory variables:

a. years of postsecondary education aloneb. years of experience alonec. age aloned. age and years of experience

25. Compare all possible models for the Salaries data, andselect the “best” regression model.

SALARIES

EXA14-5b

SEC14-2

SEC14-6

14Ch14_Skuce.QXD 1/16/10 5:49 AM Page 559

Page 36: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

PART V ANALYZING RELATIONSHIPS560

EXHIBIT 14.26Excel Regression Output for Credit Card Data Set

From the F statistic and the associated p-value, we can see that the model is signifi-cant. There appears to be a relationship between the monthly credit card bill and at leastone of income and the gender of the head of household. As well, the p-value for the t-test of the significance of the gender variable is 0.023, indicating that it is significant inthe model (for a significance level of 0.05, for instance).

The regression relationship is as follows (with some rounding of the coefficients):

Monthly credit card bill � $582 � $19 (income in $000) � $342 (gender variable)

Notice that this means the following:

• If the credit card holder is male (gender variable � 0), the regression relationship is:

Monthly credit card bill � $582 � $19 (income in $000)

• If the credit card holder is female (gender variable � 1), the regression relationship is:

Monthly credit card bill � $582 � $19 (income in $000) � $342

� ($582 � $342) � $19 (income in $000)

� $240 � $19 (income in $000)

A binary indicator variable is sometimes called a “shift” variable, because it shifts the y-intercept while leaving the slope unchanged. The two regression relationships are shown inExhibit 14.27 opposite. The regression relationship predicting monthly credit card bills for

14Ch14_Skuce.QXD 1/16/10 5:49 AM Page 560

Page 37: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

CHAPTER 14 ANALYZING LINEAR RELATIONSHIPS, TWO OR MORE VARIABLES 561

EXHIBIT 14.27Effect of Gender Variable on Credit Card–Income Relationship

Monthly Credit Card Bills and Annual Income

$0

$500

$1,000

$1,500

$2,000

$2,500

$3,000

50 60 70 80 90 100Annual Income ($000)

Mo

nth

ly C

red

it C

ard

Bill

PredictedMonthly CreditCard Bill forMale Head ofHousehold

PredictedMonthly CreditCard Bill forFemale Head ofHousehold

male heads of household (the blue line in the graph) is $342 higher than the regression rela-tionship predicting monthly credit card bills for female heads of household (the red line).

Another way to think of the binary variable is as an on-off switch. When the switchis on (gender variable � 1), we are estimating the monthly credit card bills of femaleheads of household. When the switch is off (gender variable � 0), we are estimating themonthly credit card bills of male heads of household.

Adding gender to the regression model improves it over the model with incomealone (in Develop Your Skills 14.6, Exercise 26, you will get a chance to explore this).However, we should not automatically conclude that gender is the cause of the differ-ence in monthly credit card bills. Other factors such as wealth and the number of peoplein the household may be the cause of the differences we have observed in the credit cardbills of male and female heads of household.

Since the end result of this regression analysis is two regression lines, one for eachgender, you might wonder if it would be reasonable to skip the indicator variable, andinstead build two regression models, one for male heads of household and one forfemale heads of household. The advantage of using the indicator variable approach isstraightforward. Using the indicator variable in the model allows us to test whether gen-der has a significant effect on the credit card bills. If we simply build two models, wecannot conduct this statistical test. As well, pooling all of the data together allows us tomake better estimates of the coefficients.

Another question you might have is this: is it possible to do regression analysis withonly indicator variables, and no quantitative explanatory variables? The answer is yes, andthe equivalent procedure is covered in Chapter 11. There, we tested to see if a quantitativeresponse variable (such as battery life) varied according to levels of a qualitative explanatoryfactor (such as battery brand). The equivalence between the ANOVA procedures in Chapter11 and regression with indicator variables is illustrated on the next page in Example 14.6.Since our ANOVA analysis considered factors with more than two levels, we will firstconsider regression using qualitative explanatory variables with more than two categories.

14Ch14_Skuce.QXD 1/16/10 5:49 AM Page 561

Page 38: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

Indicator Variables for Qualitative Explanatory Variables with More ThanTwo Categories Sometimes the qualitative variable we are interested in has more thantwo possible (mutually exclusive) results. For example, we might be interested in whetherthree different brands of battery are associated with different battery life in minutes.

In such a case, we can use a series of indicator variables that tell us about the pres-ence (value � 1) or absence (value � 0) of the brand characteristic. In the case of threebattery brands, we will use two indicator variables in combination. For example, wecould set the Onever variable � 1 if the brand is Onever, and 0 otherwise. We would setthe Durible variable � 1 if the brand is Durible, 0 otherwise. This is sufficient, becauseif both indicator variables are equal to zero, this necessarily means that the third brandcategory (PlusEnergy) must apply. Exhibit 14.28 below illustrates. Note that we couldhave used any of the possible categories as the “missing” one.

PART V ANALYZING RELATIONSHIPS562

It is important to use one fewer indicator variables than categories, to avoid problemswith the regression analysis. We know that the value of the PlusEnergy indicator variablewould be equal to (1 – sum of the values of the indicator variables for the other batterybrands). Including a third indicator variable in the model would violate the requirementfor independence of the explanatory variables. In addition, it would cause an error in Excel.

Recognize that some common sense is required when introducing qualitative vari-ables into regression analysis. If a qualitative variable has many possible categories,requiring the use of several indicator variables, and if more than one qualitative variableis being considered, the number of explanatory variables can quickly get very high, per-haps too high to be reasonable for small data sets. As always, think carefully before youintroduce a qualitative variable (or any variable) into the analysis.

Example 14.6 below illustrates the use of indicator variables for a qualitativeexplanatory variable with more than two possible categories.

EXHIBIT 14.28Two Indicator Variables for Three Battery Brands

Two Indicator Variables (in Combination)

to Convey Three Battery Brands

Onever 1 0

Durible 0 1

PlusEnergy 0 0

Regression with a qualitativeexplanatory variable

The owner of a winery is wondering whether the average purchase of visitors to herwinery differs according to age. She asks the cashiers to keep track of a random sample ofpurchases by customers in three age groups: under 30, 30–50, over 50. Because there is nogood reason to ask a customer his or her age, the cashiers guess which age group a cus-tomer belongs to (and if they do not guess accurately, the research may not be helpful).Eventually, data from about 50 purchases made by customers in each of the age groups iscollected. Does it appear that there is a relationship between customer age and the valueof the winery purchase?

EXAMPLE 14.6

EXA14-6

14Ch14_Skuce.QXD 1/16/10 5:49 AM Page 562

Page 39: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

CHAPTER 14 ANALYZING LINEAR RELATIONSHIPS, TWO OR MORE VARIABLES 563

First, the winery data must be arranged with all purchases in one column, and the indi-cator variables set up to indicate age group. Exhibit 14.29 below provides an excerptfrom an Excel spreadsheet where the data are set up as required.

EXHIBIT 14.29Excerpt of Winery Purchase Data Set with Indicator Variables for Age Group

...

...

Next, run Excel’s Regression tool on the data set. The output is as shown inExhibit 14.30 on the next page.

How do we interpret the regression equation?

• When the “under 30” indicator variable � 1, then:

Winery purchase for customers under 30 � $132.47 � $54.90 (1) � $12.80 (0) � $77.57

• When the “30–50” indicator variable � 1, then:

Winery purchase for customers aged 30–50 � $132.47 � $54.90 (0) � $12.80 (1) � $119.67

• When the “under 30” indicator variable � 0, and the “30–50” indicator variable � 0,then the age group is “over 50.”

Winery purchase for customers over 50 � $132.47 – $54.90 (0) – $12.80 (0) � $132.47

Notice that a hypothesis test about the overall regression would report an F statisticof 67.49, with a very low significance level. Thus we can conclude that there is evidence ofa relationship between winery purchase and age group. As well, each of the indicatorvariables included in the regression is significant (at the 5% level of significance).

For comparison, the Excel output for Anova: Single Factor is shown in Exhibit 14.31on the next page.

The F-statistic for the hypothesis test about equality of population means is also 67.49,with the same significance level as in the Regression output. Notice that both proceduresanalyze the data in a similar way. In the regression analysis, the total variation in the y-variable(winery purchases) is partitioned into the portion that can be explained by the x-variable(the age groups), and the portion that is unexplained (the residuals, or errors). In the

14Ch14_Skuce.QXD 1/16/10 5:49 AM Page 563

Page 40: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

PART V ANALYZING RELATIONSHIPS564

EXHIBIT 14.30Regression Output for Winery Purchase Data Set

EXHIBIT 14.31Anova: Single Factor Excel Output for Winery Purchase Data Set

14Ch14_Skuce.QXD 1/16/10 5:49 AM Page 564

Page 41: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

CHAPTER 14 ANALYZING LINEAR RELATIONSHIPS, TWO OR MORE VARIABLES 565

DEVELOP YOUR SKILLS 14.626. Examine the data set for credit card bills. Create a regres-

sion model for credit card bills based on income. Comparethis with the regression model for credit card bills basedon income and the gender of the head of household.Does adding the gender variable improve the modelsignificantly?

27. Build a regression model, with indicator variables forbattery brand, to assess whether there is a significantrelationship between battery life in minutes and batterybrand. This revisits the battery example in Chapter 11.

28. A sales manager is trying to build a sales forecastingmodel based on number of sales contacts and region. Isregion a significant explanatory variable in this model?

29. A production manager has collected data on the num-ber of units produced and the number of employees at

work, for the day shift and the night shift. Is shift a sig-nificant explanatory variable for the number of unitsproduced?

30. Statistics Canada collects census data about Canadiansevery five years. The department provides data files thatcontain a representative sample of anonymous individ-ual responses to census surveys. A subset of these data isprovided in the file DYS14-30. There is information onCanadians from Alberta and Ontario. Age and wagesand salaries are shown for individuals who had non-zero wages and salaries in the data set.7 Use an indicatorvariable for province, and build a regression model forwages and salaries, based on age and province. Isprovince a significant explanatory variable in thismodel?

SEC14-6

DYS14-27

DYS14-28

DYS14-29

DYS14-30

7 Data for this exercise are a subset of the data available in the StatsCan microfile. Only age, wages, andsalaries for those with non-zero wages and salaries data are used, for only Alberta and Ontario.

analysis of variance, the F statistic compares the variation associated with the age groups(the “between groups” variation) with the unexplained variation (“within groups”). Ofcourse, the sums of squares and the mean squares are exactly the same for the two procedures,because they are accomplishing the same tasks. This explains why some of the regressionoutput has the “ANOVA” heading. The Regression sum of squares (SS) in the Regressionoutput is the same as the Between Groups SS in the ANOVA output. The Residual SS in theRegression output is the same as the Within Groups SS in the ANOVA output.

MORE ADVANCED MODELLINGThis chapter has been an introduction to building mathematical models of linear rela-tionships between quantitative response variables and two or more explanatory vari-ables. Within the chapter we have seen many modelling possibilities.

You should be aware that more advanced mathematical modelling techniques exist,which are beyond the scope of this text. It is possible to build models that are polynomial, toaccount for curvature in the relationships. There are special techniques for time-series trendanalysis. With the appropriate training and good computer software, it is possible to buildcomplex and sophisticated models of relationships. However, complex models are not nec-essarily the “best” models. The simplest model that provides useful predictions is preferred.

Finally, always remember that mathematical models generally cannot prove causeand effect, and we should always be careful in interpreting the results of model building.Even if a model appears to work very well, the true cause-and-effect relationship maynot have been revealed.

14.7

14Ch14_Skuce.QXD 1/16/10 5:49 AM Page 565

Page 42: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

Determining the RelationshipBegin by thinking carefully about the explanatory variables that might reasonably be expected toaffect the response variable. Create scatter plots to examine the relationship between the responsevariable and each explanatory variable. Use Excel’s Regression tool to estimate the coefficients ofthe multiple regression relationship.

Checking the Required ConditionsTheoretically, there is a normal distribution of possible y-values for every combination of x-values.The population relationship we are trying to model is as follows:

y � b0 � b1x1 � b2x2 � � � � � bkxk � �

We cannot reliably make predictions with the regression equation, or conduct a hypothesistest about the significance of the regression relationship, unless certain conditions are met (theseare summarized in the box on page 532). The Guide to Technique: Checking Requirements for theLinear Multiple Regression Model on page 540 outlines a process for checking the required condi-tions for the regression model.

How Good Is the Regression?If the required conditions are met, conduct an F-test of the significance of the relationship. Thiswill be of the form:

H0: b1 � b2 � � � � � bk � 0

H1: At least one of the bi’s is not zero.

The F statistic is

with (k, n – (k � 1)) degrees of freedom, where n is the number of observed data points and k isthe number of explanatory variables in the model. When the response variable is related to at leastone of the explanatory variables, MSR will be significantly larger than MSE.

The output of Excel’s Regression tool provides the F statistic and the p-value for this test. Seepage 544 for instructions on how to read the output.

If the results of the F-test show that the model is significant, t-tests of the significance of theindividual explanatory variables can be conducted. The test of the coefficient of explanatory variablei is conducted as follows.

H0: bi � 0

H1: bi ≠ 0

The test statistic is , with (n � (k � 1)) degrees of freedom.

The output of Excel’s Regression tool provides the p-values for the two-tailed tests of signif-icance for the individual coefficients. See page 545 for instructions on how to read the output.

The adjusted R2 value is a measure of the strength of the relationship between the explana-tory variables and the response variable.

The adjusted R2 is calculated by Excel as part of the Regression output.

Adjusted R2= 1 -

SSEn - 1k + 12

SSTn – 1

t = bisbi

F =

SSRk

SSEn - 1k + 12

= MSRMSE

14

ChapterSummary

14Ch14_Skuce.QXD 1/16/10 5:50 AM Page 566

Page 43: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

Making PredictionsTwo types of estimation intervals can be created if the requirements are met. A regression predic-tion interval predicts a particular value of y, given a specific set of x-values. A regression confidenceinterval predicts the average y, given a specific set of x-values. The Multiple Regression ToolsExcel add-in (Prediction and Confidence Intervals Calculations) calculates these intervals(see page 548). Always remember that it is not legitimate to make predictions outside the range ofthe sample data.

Selecting the Appropriate Explanatory VariablesThe goals of a good regression model include the following:

1. The model should be easy to use. It should be reasonably easy to acquire data for the model’sexplanatory variables.

2. The model should be reasonable. The coefficients should represent a reasonable cause-and-effect relationship between the response variable and the explanatory variables.

3. The model should make useful and reliable predictions. Prediction and confidence intervalsshould be reasonably narrow.

4. The model should be stable. It should not be significantly affected by small changes inexplanatory variable data.

Use Excel to create all possible regression models for all combinations of possible explanatoryvariables. The Multiple Regression Tools Excel add-in (All Possible Regressions Calculations)makes this easy to do. The add-in produces a summary report which shows each regressionmodel, the adjusted R2 value, the standard error (sε), and the number of variables (k) for eachmodel. Use these measures to select a “best” model. Be sure to check the model chosen to see thatit meets the required conditions. Review the appropriate p-values to ensure that the overall modelis significant, and that the individual explanatory variables are significant.

Multicollinearity occurs when one of the explanatory variables is highly correlated with oneor more of the other explanatory variables. This can result in unstable or inaccurate regressioncoefficients. To guard against this problem, choose explanatory variables carefully. Create scatterdiagrams of explanatory variable pairs, and create a correlation matrix for all the variables in themodel. If there is a pronounced pattern visible in the scatter diagram or a high correlation for apair of variables, consider including only one of them in the final model.

Using Indicator Variables in Multiple RegressionThe effect of a qualitative characteristic on a response variable (for example, male/female,urban/rural) can be modelled with indicator variables (sometimes called “dummy” variables). Ifthe qualitative variable we are interested in is binary, we can represent it with a single indicatorvariable (for example, “0” for male, “1” for female). If the qualitative variable has more than twopossible (mutually exclusive) results, we can use a series of indicator variables that tell us aboutthe presence (value � 1) or absence (value � 0) of the qualitative characteristic. It is importantto use one fewer indicator variables than categories, to avoid problems with the regressionanalysis.

CHAPTER SUMMARY 567

Go to MyStatLab at www.mathxl.com. You can practise the exercises indicated with red in theDevelop Your Skills and Chapter Review Exercises as often as you want, and guided solutions willhelp you find answers step by step. You’ll find a personalized study plan available to you too!

14Ch14_Skuce.QXD 1/16/10 5:50 AM Page 567

Page 44: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

PART V ANALYZING RELATIONSHIPS568

CHAPTER REVIEW EXERCISESThe Chapter Review Exercises allow you to explore two major data sets. One, called “Credit Card,”contains data about credit card balances and possible explanatory variables such as income andnumber of people in the household. Another data set, called “Marks,” contains data about the finalexam mark and marks on evaluations done during the semester (assignments, tests, and quizzes).

WARM-UP EXERCISES1. The multiple regression model for monthly credit card balances and the age of the head of

household, income (in thousands of dollars), and the value of the home (in thousands ofdollars) is described in the Excel output shown below in Exhibit 14.32. Interpret the model.

CREDIT CARD

MARKS

2. Refer to the Excel output shown in Exhibit 14.32 above. Is the overall model significant?You will have to estimate the p-value from the tables at the back of the text. Use a 5% levelof significance.

3. Refer to the Excel output shown in Exhibit 14.32 above. Test the individual coefficients forsignificance. Use a 5% level of significance.

EXHIBIT 14.32Excel Output for Monthly Credit Card Balances, Income in Thousands, and the Number of People in the Household

14Ch14_Skuce.QXD 1/16/10 5:50 AM Page 568

Page 45: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

THINK AND DECIDE5. Consider the Marks data set, where the goal is to predict the final exam mark. Possible

explanatory variables are the marks on Assignments 1 and 2, Tests 1 and 2, and Quiz marks.The tests and the final exam are written in a classroom, with all computations done manuallywith a calculator. The quizzes are done with online testing software, and the calculations canbe done manually with a calculator or with Excel. Students can attempt the quizzes as manytimes as they wish before the due date (the quizzes are similar but not the same). The assign-ments are Excel-based and include a written report on the Excel analysis. Based on this infor-mation, which of the evaluations do you think would be the best predictor of the final exammark, and why?

6. Exhibit 14.34 below shows the correlation matrix for the Marks data set. Is there any concernabout multicollinearity? Which explanatory variables seem the most promising?

CHAPTER SUMMARY 569

EXHIBIT 14.33Correlation Matrix for Credit Card Data Set

EXHIBIT 14.34Correlation Matrix for Marks Data Set

7. Exhibit 14.35 on the next page shows the output of the All Possible Models Calculationstool in Multiple Regression Tools for all of the Marks models with one explanatory variable.Given these results, is there one model that you would choose as better than the rest? If so,explain why.

4. Given your answers to Exercises 2 and 3, what concerns do you have about the model? A correlation matrix for the values in the model is shown below in Exhibit 14.33 below.

14Ch14_Skuce.QXD 1/16/10 5:50 AM Page 569

Page 46: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

THINK AND DECIDE USING EXCEL8. Use the model for credit card balances illustrated in Exhibit 14.32 to create a 95% prediction

interval for the monthly credit card balance of a credit card holder where the age of the headof household is 45, income is $65,000, and the value of the home is $175,000. Do you thinkthis regression model is useful?

9. Use Excel to create all possible Marks models, and then consider those that have two explana-tory variables. Note that there are 10 such models. Which of these models is best, and why? Isthis model a real improvement on the best single-variable model? Explain.

PART V ANALYZING RELATIONSHIPS570

CREDIT CARD

MARKS

EXHIBIT 14.35All Possible Models Calculations Output for Marks Models with One Explanatory Variable

14Ch14_Skuce.QXD 1/16/10 5:50 AM Page 570

Page 47: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

10. Check that the best model you selected in Exercise 9 meets the required conditions.

11. For the Marks data set, create and examine all models with three explanatory variables thatinclude the mark on Test 2. Note that there will be six of these models. Does any of these rep-resent a real improvement on the best two-variable model you selected in Exercise 9? Explain.

12. Create a regression model for the Marks data using all of the explanatory variables. In light ofthe work you did in Exercises 9, 10, and 11, is this the best model? Explain.

13. Use the model you decided was best for the Marks data to predict the final exam mark of astudent who received a mark of 55 on Assignment 1, 60 on Test 1, 65 on Assignment 2, 70 onTest 2, and 95 on the quizzes.

14. A researcher has collected a random sample of data about Honda Accords for sale in Ontario.The data indicate year of the car, number of kilometres, and list price. Create and analyze allpossible regression models for these data. Be sure to check that the required conditions aremet. Is your model useful in terms of predicting the list price of used Honda Accords?

15. Think about your analysis in Exercise 14. Is the year of the car a quantitative variable? Createan indicator variable for the year of the car, and rebuild the model. Describe the models, andchoose the best one.

16. A chain of retail outlets famous for their delicious (if unhealthy) doughnuts is looking for anew location. The company is trying to use data on local median income, population in thelocal area, and traffic flows by a proposed location to decide where to open a new store. Thecompany has collected data for a number of existing stores. Investigate these data, and make arecommendation to the company about how to proceed.

17. A researcher has discovered some extra data for the doughnut store location decisiondescribed in Exercise 16 above. Information was collected about whether each location waswithin a five-minute drive of a major highway (1 � within a five-minute drive of a majorhighway, 0 � otherwise). Re-analyze the data, including this extra information.

18. An MBA (Master of Business Administration) student decides to see if he can predict theStandard and Poor’s Toronto Stock Exchange Composite Index from the price of one ormore share prices of Canadian companies.a. The student collects monthly historical price data (November 2002 to November 2008)

for stocks from some important Canadian sectors8:• Rona Incorporated, the largest Canadian distributor and retailer of hardware, home

renovation, and gardening products• Royal Bank of Canada, a major Canadian bank• Petro Canada, a Canadian oil and gas company with international interests• Potash Corporation of Saskatchewan, an integrated producer of fertilizer, industrial,

and animal feed productsDo any of these stocks (or a combination of these stocks) provide a good predictor of theTXS Composite Index?

b. During the fall of 2008, the world economy experienced an unprecedented crisis andstock markets around the world gyrated wildly. Is there evidence of this in the data youexamined in part a of this question? Would it be wise to try to build a model to predict theTSX Composite Index using data from this period? Explain.

19. Statistics is a course with a bad reputation. Students tend to expect that they will have diffi-culty with the course, even when they do not know exactly what the course is about. A stu-dent decides that he wants to place Statistics in a proper context, and he collects data on arandom sample of students studying in their third semester (the beginning of second year).

CHAPTER SUMMARY 571

CRE14-14

CRE14-16

CRE14-17

CRE14-18

CRE14-19

8 Yahoo! Finance Canada, “Potash Corp Sask Com NPV (POT.TO), Royal Bk of Canada Com NPV(RY.TO), S&P/TSX Composite index (Interi ∧GSPTSE), Rona Inc Com NPV (RON.TO), PetroCanada Com NPV (PCA.TO), Historical Prices, November 20, 2008,” http://ca.finance.yahoo.com,accessed November 20, 2008.

14Ch14_Skuce.QXD 1/16/10 5:50 AM Page 571

Page 48: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

The student attempts to predict the Statistics mark from the marks in other courses. Helphim by deciding which of the possible models is best.

20. Check the required conditions of the model you chose in Exercise 19. If the required condi-tions are not met, do some further analysis and develop a model that will predict theStatistics mark and meets the required conditions.

21. Use the best model you created in Exercise 20 to predict the Statistics mark of a student whoreceived 65 in all of the other courses.

TEST YOUR KNOWLEDGE22. Marcharpex is a company selling specialized software products to a select number of manu-

facturing companies. The company relies on senior salespeople to make contacts, sell theproduct, and provide a company contact for after-sales support. The company is wonderingif its sales model is effective, and it has collected some data on

• the years of experience of the salesperson• the monthly travel and entertainment budget of the salesperson• the local advertising budget (monthly) for the salesperson’s area• the sales in the area

Analyze the data, and select the best model to predict sales. Be sure to check the required con-ditions. Once you select the best model, create a 95% prediction interval (approximate) ofthe sales for a salesperson who has 15 years of experience, a monthly travel and entertain-ment budget of $2,000, and a local advertising budget of $4,000.

A NOTE ABOUT EXCEL’S FLOATING POINT PROBLEMDepending on your computer, you may have seen a different result in your output for theWoodbon model with mortgage rates and advertising expenditure. The model might have beeny � $10.31 � $0.87x2 � 0.01x3, with an adjusted R2 of 0.948, and a standard error of only $2! Thefirst time the author ran the regression in Excel (on an older computer) this was the result.However, if we examine this model, we see that it does not make any sense. If advertising expendi-ture were $3,000 and mortgage rates were 7%, this model predicts that Woodbon’s sales would be$10.31 � $0.87(7) � 0.01($3,000) � $34.22. This prediction is clearly unreasonable. This is a goodlesson in using some common sense, and not relying too much on measures such as R2 to choose aregression model. But what went wrong?

Excel has a “floating point” problem that sometimes produces inaccurate results when thedata used to calculate the model are on significantly different scales. Advertising expendituresrange from $500 to $3,500 and sales range from $21,334 to $115,320. In contrast, mortgage ratesrange from 5.99 to 18.38. Because mortgage rates vary only by units (as these rates are expressed),and the other variables vary by hundreds or thousands, the scales are not the same general orderof magnitude. The alternate model shown above does not make sense, because Excel and an oldercomputer did not successfully handle the situation. While such a problem does not arise often, itcan be a good idea to scale your input data to the same order of magnitude. As well, if Excel pro-duces nonsensical results, you should check the scale of your input data and re-scale if necessary.

Fortunately, the fix is easy. If this happened to you, simply change the scale of the mortgagerates by multiplying by 100, for example. A mortgage rate data point of 14.52083 becomes 1452.083,which is effectively 100x2. Then the mortgage rate data points will vary by hundreds, and will be ona similar scale with the other explanatory variable in the model. The multiple regression model thatincludes these adjusted mortgage rate data points and advertising expenditure is:

y � $59,670.88 � $31.3253 (100x2) � 22.74x3.

This can of course be rewritten as

y � $59,670.88 � $3,132.53x2 � 22.74x3.

This model matches the output in Exhibit 14.22.

PART V ANALYZING RELATIONSHIPS572

CRE14-22

14Ch14_Skuce.QXD 1/16/10 5:50 AM Page 572

Page 49: PART V ANALYZING RELATIONSHIPS 14 - Pearson Ed

14Ch14_Skuce.QXD 1/16/10 5:50 AM Page 573