multiple linear regression -...

31
Multiple Linear Regression Introduction Case Study Chapter Review c h e r a p t chap16.2EY 2002-06-07 14:35 Page 1 B

Upload: others

Post on 30-Aug-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

Multiple Linear Regression

Introduction

Case Study

Chapter Review

c h e ra p t

chap16.2EY 2002-06-07 14:35 Page 1

B

Page 2: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

INTRODUCTIONIn Chapter 14 we studied formal inference in the setting of a linear relation-ship between a response variable y and a single explanatory variable x. Now weturn to the more complex setting in which several explanatory variables worktogether to explain the response. This is the idea of multiple linear regression.The descriptive tools we learned in Chapter 3—scatterplots, least-squares re-gression, and correlation—are essential preliminaries to inference about themultiple linear regression model. However, the introduction of several ex-planatory variables leads to many additional considerations. In this short chap-ter we cannot explore all of these issues. Rather, we will outline some basicfacts about inference in the multiple regression setting and illustrate multipleregression analysis with a case study.

Here are some examples of questions we can answer from data using themethods in this chapter:

• We want to predict the college grade point average of newly admitted stu-dents. We have data on their high school grades in several subjects and theirscores on the two parts of the SAT. How well can we predict college gradesfrom this information? Do high school grades or SAT scores predict collegegrades more accurately?

• How do banks determine the interest rates that they will charge for a loan?To what extent do the rates depend on characteristics of the loan and charac-teristics of the borrower?

Population multiple regression equationThe simple linear regression model assumes that the mean of the responsevariable y depends on the explanatory variable x according to a linear equation

�y = �0 + �1x

For any fixed value of x, the response y varies normally around this mean andhas a standard deviation � that is the same for all values of x.

In the multiple regression setting, the response variable y depends on notone but several explanatory variables. We will denote these explanatory vari-ables by x1, x2, . . . , xp. The mean response is a linear function of the explana-tory variables:

�y = �0 + �1x1 + �2x2 + . . . + �pxp

This expression is the population regression equation. We can think ofsubpopulations of responses, each corresponding to a particular set of val-ues for all of the explanatory variables x1, x2, . . . , xp. In each subpopula-tion, y varies normally with a mean given by the population regression

B-2 Multiple Linear Regression

population regressionequation

chap16.2EY 2002-06-07 14:35 Page 2

Chapter B

Page 3: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

equation. The regression model assumes that the standard deviation � ofthe responses is the same in all subpopulations. We do not observe themean response, �y, because the observed values of y vary about their sub-population means.

Introduction B-3

Our case study uses data collected at a large university on all first-year computer sci-ence majors in a particular year.1 The purpose of the study was to attempt to predictsuccess in the early university years. One measure of success was the cumulative gradepoint average (GPA) after three semesters. Among the explanatory variables recordedat the time the students enrolled in the university were average high school grades inmathematics (HSM), science (HSS), and English (HSE).

We will use high school grades to predict the response variable GPA. There arethree explanatory variables: x1 = HSM, x2 = HSS, and x3 = HSE. The high schoolgrades are coded on a scale from 1 to 10, with 10 corresponding to A, 9 to A–, 8 to B+,and so on. These grades define the subpopulations. For example, the straight-C stu-dents are the subpopulation defined by HSM = 4, HSS = 4, and HSE = 4.

One possible multiple regression model for the subpopulation mean GPAs is

�GPA = �0 + �1HSM + �2HSS + �3HSE

For the straight-C subpopulation of students, the model gives the subpopulation mean as

�GPA = �0 + 4�1 + 4�2 + 4�3

EXAMPLE 16.1 PREDICTING COLLEGE SUCCESS

Data for multiple regressionThe data for a simple linear regression problem consist of observations (xi, yi)on the two variables. Because there are several explanatory variables in mul-tiple regression, the notation needed to describe the data is more elaborate.In our case study, each observation or case consists of a value for the re-sponse variable and for each of the explanatory variables for one student.Call xij the value of the jth explanatory variable for the ith student. Thedata are then

Student 1: (x11, x12, . . . , x1p, y1)

Student 2: (x21, x22, . . . , x2p, y2)

Student n: (xn1, xn2, . . . , xnp, yn)

Here, n is the number of cases and p is the number of explanatory variables.Data are often entered into computer regression programs in this format. Eachrow is a case and each column corresponds to a different variable. The data forExample 16.1, with several additional explanatory variables, appear in this for-mat in the CSDATA file in the Datasets folder on this CD.

chap16.2EY 2002-06-07 14:35 Page 3

Page 4: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

CASE STUDYIn this section we prepare for multiple regression by analyzing the data fromthe study described in Example 16.1. The response variable is the cumulativeGPA after three semesters for a group of computer science majors at a largeuniversity. The explanatory variables previously mentioned are average highschool grades, represented by HSM, HSS, and HSE. We also examine the SATMathematics (SATM) and SAT Verbal (SATV) scores as explanatory variables.We have data for n = 224 students in the study. We use SAS to illustrate theoutputs that are given by most software. The first step in the analysis is to care-fully examine each of the variables.

Because the variables GPA, SATM, and SATV have many possible values, wecould use stemplots or histograms to examine the shapes of their distributions.Figure 16.1 shows histograms of these three variables. The distribution of SATMscores for our sample of 224 computer science majors (Figure 16.1(a)) is fairly sym-metric. There are a couple of unusually low scores apparent in the SATM his-togram. The histogram of SATV scores for these students (Figure 16.1(b)) isslightly right-skewed. A quick comparison of these two plots reveals that the centerof the SATM distribution is considerably higher than that of the SATV distribution.In other words, these computer science majors tended to earn much better scoreson the Math section of the SAT than on the Verbal section. Figure 16.1(c) showsthat after three semesters in college, these 224 students’ cumulative GPAs (mea-sured on a 4-point scale) are strongly left-skewed.

It is important to note that the multiple regression model does not requireany of these distributions to be normal. Only the deviations of the responses yfrom their means are assumed to be normal. The purpose of examining theseplots is to understand something about each variable alone before attemptingto use it in a complicated model. Extreme values of any variable should benoted and checked for accuracy. If found to be correct, the cases with these val-ues should be carefully examined to see if they are truly exceptional and per-haps do not belong in the same analysis with the other cases. When these dataare examined in this way, no obvious problems are evident.

The high school grade variables HSM, HSS, and HSE have relatively fewvalues and are best summarized by giving the relative frequencies for each pos-sible value. The output in Figure 16.2 (page x) provides these summaries. Thedistributions are all skewed, with a large proportion of high grades (10 = A and9 = A–). Again we emphasize that these distributions need not be normal.

Next we compute numerical summaries. Means, standard deviations, andminimum and maximum values appear in Figure 16.3 (page x).

The minimum value for the SATM variable appears to be rather extreme;it is (595 – 300)/86 = 3.43 standard deviations below the mean. We do not dis-card this case at this time but will take care in our subsequent analyses to seeif it has an excessive influence on our results. The mean for the SATM scoreis higher than the mean for the SATV score, as we might expect for a group ofcomputer science majors. The two standard deviations are about the same.

B-4 Multiple Linear Regression

chap16.2EY 2002-06-07 14:35 Page 4

Chapter B

Page 5: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

The means of the three high school grade variables are similar, with the math-ematics grades being a bit higher. The standard deviations for the high schoolgrade variables are very close to each other. The mean GPA is 2.635 on a 4-point scale, with standard deviation 0.779.

Case Study B-5

30

20

10

0300 400 500 600 700 800

SATM

SATV

GPA

Freq

uenc

y

30

20

10

0250 350 450 550 650 750

Freq

uenc

y

30

20

10

00 1 2 3 4

Freq

uenc

y

(a)

(b)

(c)

FIGURE 16.1 Histograms of three explanatory variables for the case study described inExample 16.1: (a) SATM, (b) SATV, and (c) the response variable GPA.

chap16.2EY 2002-06-07 14:35 Page 5

Page 6: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

B-6 Multiple Linear Regression

Cumulative CumulativeHSM Frequency Percent Frequency Percent-------------------------------------------------

2 1 0.4 1 0.43 1 0.4 2 0.94 4 1.8 6 2.75 6 2.7 12 5.46 23 10.3 35 15.67 28 12.5 63 28.18 36 16.1 99 44.29 59 26.3 158 70.5

10 66 29.5 224 100.0

Cumulative CumulativeHSM Frequency Percent Frequency Percent-------------------------------------------------

3 1 0.4 1 0.44 7 3.1 8 3.65 9 4.0 17 7.66 24 10.7 41 18.37 42 18.8 83 37.18 31 13.8 114 50.99 50 22.3 164 73.2

10 60 26.8 224 100.0

Cumulative CumulativeHSM Frequency Percent Frequency Percent-------------------------------------------------

3 1 0.4 1 0.44 4 1.8 5 2.25 5 2.2 10 4.56 23 10.3 33 14.77 43 19.2 76 33.98 49 21.9 125 55.89 42 23.2 177 79.0

10 47 21.0 224 100.0

FIGURE 16.2 The distributions of the high school grade variables.

Variable N Mean Std Dev Minimum Maximum---------------------------------------------------------------------GPA 224 2.6352232 0.7793949 0.1200000 4.0000000SATM 224 595.2857143 86.4014437 300.0000000 800.0000000SATV 224 504.5491071 92.6104591 285.0000000 760.0000000HSM 224 8.3214286 1.6387367 2.0000000 10.0000000HSS 224 8.0892857 1.6996627 3.0000000 10.0000000HSE 224 8.0937500 1.5078736 3.0000000 10.0000000---------------------------------------------------------------------

FIGURE 16.3 Descriptive statistics for the computer science student case study.

chap16.2EY 2002-06-07 14:35 Page 6

Chapter B

Page 7: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

Relationships between pairs of variabIesThe second step in our analysis is to examine the relationships between all pairsof variables. Scatterplots and correlations are our tools for studying two-variablerelationships. The correlations appear in Figure 16.4. The output includes theP-value for the test of the null hypothesis that the population correlation is 0 ver-sus the two-sided alternative for each pair. Thus, we see that the correlation be-tween GPA and HSM is 0.44, with a P-value of 0.0001, whereas the correlationbetween GPA and SATV is 0.11, with a P-value of 0.087. The first is statisticallysignificant by any reasonable standard, and the second is rather marginal.

The high school grades all have higher correlations with GPA than do the SATscores. As we might expect, math grades have the highest correlation (r = 0.44), followed by science grades (0.33) and then English grades (0.29). The two SATscores have a rather high correlation with each other (0.46), and the high schoolgrades also correlate well with each other (0.45 to 0.58). The SATM score corre-lates well with HSM (0.45), less well with HSS (0.24), and rather poorly with HSE(0.11). The correlations of the SATV score with the three high school grades areabout equal, ranging from 0.22 to 0.26.

It is important to keep in mind that by examining pairs of variables we areseeking a better understanding of the data. The fact that the correlation of aparticular explanatory variable with the response variable does not achieve sta-tistical significance does not necessarily imply that it will not be a useful (andsignificant) predictor in a multiple regression.

Case Study B-7

Pearson Correlation Coefficients / Prob > |R| under Ho: Rho=0 / N = 224

GPA

SATM

SATV

HSM

HSS

HSE

GPA

1.000000.0

0.251710.0001

0.114490.0873

0.436500.0001

0.329430.0001

0.289000.0001

SATM

0.251710.0001

1.000000.0

0.463940.0001

0.453510.0001

0.240480.0003

0.108280.1060

SATV

0.114490.0873

0.463940.0001

1.000000.0

0.221120.0009

0.261700.0001

0.243710.0002

HSM

0.436500.0001

0.453510.0001

0.221120.0009

1.000000.0

0.575690.0001

0.446890.0001

HSS

0.329430.0001

0.240480.0003

0.261700.0001

0.575690.0001

1.000000.0

0.579370.0001

HSE

0.289000.0001

0.108280.1060

0.243710.0002

0.446890.0001

0.579370.0001

1.000000.0

FIGURE 16.4 Correlations among the case study variables.

chap16.2EY 2002-06-07 14:35 Page 7

Page 8: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

Numerical summaries such as correlations are useful, but plots are generallymore informative when seeking to understand data. Plots tell us whether the nu-merical summary gives a fair representation of the data. For a multiple regression,each pair of variables should be plotted. For the six variables in our case study, thismeans that we should examine 15 plots. In general there are p + 1 variables in amultiple regression analysis with p explanatory variables, so that p (p + l)/2 plots arerequired. Multiple regression is a complicated procedure. If we do not do the nec-essary preliminary work, we are in serious danger of producing useless or mislead-ing results. We leave the task of making these plots as an exercise.

EXERCISESThe following two exercises relate to the case study described in this section. They requireuse of the CSDATA file in the Datasets folder on this CD.

16.1 Use software or your calculator to make a plot of GPA versus SATM. Do the samefor GPA versus SATV. Describe the general patterns. Are there any unusual values?

16.2 Make a plot of GPA versus HSM. Do the same for the other two high schoolgrade variables. Describe the three plots. Are there any outliers or influential points?

The following five exercises use the CHEESE file in the Datasets folder on this CD. Hereis a brief description of those data.

As cheddar cheese matures, a variety of chemical processes take place. The taste ofmatured cheese is related to the concentration of several chemicals in the final product.In a study of cheddar cheese from the LaTrobe Valley of Victoria, Australia, samples ofcheese were analyzed for their chemical composition and were subjected to taste tests.2

Data for one type of cheese-manufacturing process appear below. The variable“Case” is used to number the observations from 1 to 30. “Taste” is the response variableof interest. The taste scores were obtained by combining the scores from several tasters.

Three of the chemicals whose concentrations were measured were acetic acid, hydro-gen sulfide, and lactic acid. For acetic acid and hydrogen sulfide (natural) log transfor-mations were taken. Thus the explanatory variables are the transformed concentrationsof acetic acid (“Acetic”) and hydrogen sulfide (“H2S”) and the untransformed concen-tration of lactic acid (“Lactic”).

Case Taste Acetic H2S Lactic

01 12.3 4.543 3.135 0.8602 20.9 5.159 5.043 1.5303 39.0 5.366 5.438 1.5704 47.9 5.759 7.496 1.8105 5.6 4.663 3.807 0.99�

16.3 For each of the four variables in the CHEESE data set, find the mean, median,standard deviation, and interquartile range. Display each distribution by means of astemplot and use a normal probability plot to assess normality of the data. Summarize

B-8 Multiple Linear Regression

chap16.2EY 2002-06-07 14:35 Page 8

Chapter B

Page 9: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

your findings. Note that when doing regressions with these data, we do not assume thatthese distributions are normal. Only the residuals from our model need be (approxi-mately) normal. The careful study of each variable to be analyzed is nonetheless animportant first step in any statistical analysis.

16.4 Make a scatterplot for each pair of variables in the CHEESE data set (you willhave six plots). Describe the relationships. Calculate the correlation for each pair ofvariables and report the P-value for the test of zero population correlation in each case.

16.5 Perform a simple linear regression analysis using Taste as the response variable andAcetic as the explanatory variable. Be sure to examine the residuals carefully. Summarizeyour results. Include a plot of the data with the least-squares regression line. Plot theresiduals versus each of the other two chemicals. Are any patterns evident? (The con-centrations of the other chemicals are lurking variables for the simple linear regression.)

16.6 Repeat the analysis of Exercise 16.5 using Taste as the response variable and H2Sas the explanatory variable.

16.7 Repeat the analysis of Exercise 16.5 using Taste as the response variable andLactic as the explanatory variable.

Multiple linear regression modelWe combine the population regression equation and assumptions about variationto construct the multiple linear regression mode1. The statistical model for linearregression consists of the population regression line and a description of the vari-ation of y-values about the line. The following equation expresses the idea:

DATA = FIT + RESIDUAL

The subpopulation means describe the FIT part of our statistical model. TheRESIDUAL part represents the variation of observations about the means. Toperform multiple linear regression, we need to know that the deviations of in-dividual y-values about their subpopulation means are normally distributedwith mean 0 and an unknown standard deviation � that does not depend onthe values of the x variables.

Case Study B-9

CONDITIONS FOR MULTIPLE LINEAR REGRESSION

In order to construct a multiple linear regression model, we need to know that:

• The mean response �y is a linear function of the explanatory variables:

�y = �0 + �1x1 + �2x2 + . . . + �pxp

• The residuals are independent and normally distributed with mean 0 and stan-dard deviation �. In other words, they are an SRS from the N(0, �) distribution.

The parameters of the model are �0, �1, �2, . . . , �p, and �.

chap16.2EY 2002-06-07 14:35 Page 9

Page 10: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

The assumption that the subpopulation means are related to the regressioncoefficients � by the equation

�y = �0 + �1x1 + �2x2 + . . . + �pxp

implies that we can estimate all subpopulation means from estimates of the �’s.To the extent that this equation is accurate, we have a useful tool for describ-ing how the mean of y varies with the x’s.

Estimation of the multiple regression parametersFor simple linear regression we used the principle of least squares to obtain es-timators of the intercept and slope of the regression line. For multiple regres-sion the principle is the same but more complicated. Let

b0, b1, b2, . . . , bp

denote the estimators of the parameters

�0, �1, �2, . . . , �p

For the ith observation the predicted response is yi, the value of y we obtain bysubstituting the x-values for this observation in the equation y = b0 + b1x1 + b2x2 + . . . + bpxp. The ith residual, the difference between the observed andpredicted response, is therefore

residuali = observed response – predicted response

= yi – yi

The method of least squares chooses the values of the b’s that make the sumof the squares of the residuals as small as possible. The formula for the least-squares estimates is complicated. We will be content to understand the principleon which they are based and to let software do the computations.

The parameter �2 measures the variability of the responses about the pop-ulation regression equation. As in the case of simple linear regression, we esti-mate �2 by an average of the squared residuals. The estimator is

The quantity n – p – 1 is the degrees of freedom associated with s2. The degreesof freedom equal the sample size n minus p + 1, the number of �’s we must es-timate to fit the model. In the simple linear regression case there is just one ex-

sn p

y yn p

i

i i

2

2

1

1

=− −

= −− −

residual2

( ˆ )

B-10 Multiple Linear Regression

residual

method of least squares

degrees of freedom

chap16.2EY 2002-06-07 14:35 Page 10

Chapter B

Page 11: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

planatory variable, so p = 1 and the degrees of freedom are n – 2. To estimate� we use

s =

Confidence intervals and significance tests for regression coefficientsWe can obtain confidence intervals and perform significance tests for each ofthe regression coefficients �j as we did in simple linear regression. The stan-dard errors of the b’s have more complicated formulas, but all are multiples ofs. We again rely on statistical software for the calculations.

s2

Case Study B-11

CONFIDENCE INTERVALS AND SIGNIFICANCE TESTS FOR ßj

A level C confidence interval for �j is

bj ± t*

where is the standard error of bj, and t* is the value for the t(n – p – 1) density curve with area C between –t* and t*.

To test the hypothesis H0: �j = 0, compute the t statistic

In terms of a random variable T having the t(n – p – 1) distribution, theP-value for a test of H0 against

Ha: �j > 0 is P(T ≥ t)

Ha: �j < 0 is P(T ≤ t)

Ha: �j ≠ 0 is 2P(T ≥ |t|)t

t

t

tbj

bj

=SE

bjSE

bjSE

chap16.2EY 2002-06-07 14:35 Page 11

Page 12: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

Because regression is often used for prediction, we may wish to constructconfidence intervals for a mean response and prediction intervals for a futureobservation from multiple regression models. The basic ideas are the same asin the simple linear regression case. In most software systems, the same com-mands that give confidence and prediction intervals for simple linear regres-sion work for multiple regression. The only difference is that we specify a listof explanatory variables rather than a single variable. Modern software allowsus to perform these rather complex calculations without an intimate knowl-edge of all of the computational details. This frees us to concentrate on themeaning and appropriate use of the results.

AnaIysis of variance for regressionThe usual computer output for regression includes additional calculationscalled analysis of variance. Analysis of variance, often abbreviated ANOVA, isessential for multiple regression and for comparing several means (Chapter 15).Analysis of variance summarizes information about the sources of variation inthe data. It is based on the

DATA = FIT + RESIDUAL

framework.The total variation in the response y is expressed by the deviations yi –

y–. If these deviations were all 0, all observations would be equal and therewould be no variation in the response. There are two reasons why the indi-vidual observations yi are not all equal to their mean y–. First, the responsesy i correspond to different values of the explanatory variables x1, x2, . . . , xpand will differ because of that. The fitted value yi estimates the mean re-sponse for the specific values of the xi’s. The differences yi – y– reflect thevariation in mean response due to differences in the values of the x vari-ables. This variation is accounted for by the regression line, because the y’slie exactly on the line. Second, individual observations will vary about theirmean because of variation within the subpopulation of responses to a fixedset of values of x1, x2, . . . , xp. This variation is represented by the residualsyi – yi that record the scatter of the actual observations about the fitted line.The overall deviation of any y observation from the mean of the y’s is thesum of these two deviations:

(yi – y–) = (yi – y–) + (yi – yi)

For deviations, this equation expresses the idea that DATA = FIT + RESIDUAL.We have several times measured variation by an average of squared devia-

tions. If we square each of the three deviations above and then sum over all nobservations, it is an algebraic fact that the sums of squares add:

Σ(yi – y–)2 = Σ(yi – y–)2 + Σ(yi – yi)2

B-12 Multiple Linear Regression

confidence intervals for a mean responseprediction intervals

analysis of variance

chap16.2EY 2002-06-07 14:35 Page 12

Chapter B

Page 13: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

We rewrite this equation as

SST = SSM + SSE

where

SST = Σ(yi – y–)2

SSM = Σ(yi – y–)2

SSE = Σ(yi – yi)2

The SS in each abbreviation stands for sum of squares, and the T, M, and Estand for total, model, and error, respectively. (“Error” here stands for devia-tions from the line, which might better be called “residual” or “unexplainedvariation.”) The total variation, as expressed by SST, is the sum of the variationdue to the straight-line model (SSM) and the variation due to deviations fromthis model (SSE). This partition of the variation in the data between twosources is the heart of analysis of variance.

In simple linear regression, our null hypothesis was H0: �1 = 0. This meant thatthere was no linear relationship between the explanatory and the response variables.For multiple regression, we test the null hypothesis H0: �1 = �2 = . . . = �p = 0 againstthe alternative hypothesis Ha: at least one of the �j is not 0. The null hypothesis in thiscase says that none of the explanatory variables are predictors of the response variablewhen used in the form expressed by the multiple linear regression equation. If our lin-ear model does not help us predict values of the response variable, then our simplestalternative is to use y– as our predicted y-value. In this situation, we would use

as our estimate of the variability of the responses about the line y = y–. The nu-merator in this expression is SST. The denominator is the total degrees of free-dom, or simply DFT.

Just as the total sum of squares SST is the sum of SSM and SSE, the totaldegrees of freedom DFT is the sum of DFM and DFE, the degrees of freedomfor the model and for the error:

DFT = DFM + DFE

The model has p explanatory variables x1, x2, . . . , xp, so the degrees of freedomfor this source are DFM = p. Because DFT = n – 1, this leaves DFE = n – p – 1as the degrees of freedom for error. For each source, the ratio of the sum ofsquares to the degrees of freedom is called the mean square, or simply MS. Thegeneral formula for a mean square is

Each mean square is an average squared deviation. MST is just s2y, the sample vari-

ance that we would calculate if all of the data came from a single population.MSE is familiar to us:

MS =sum of squares

degrees of freedom

sy yn

i22

1= −

−∑( )

Case Study B-13

sum of squares

degrees of freedom

mean square

chap16.2EY 2002-06-07 14:35 Page 13

Page 14: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

It is our estimate of �2, the variance about the population regression line.

MSE = = −− −

∑s

y yn p

i i22

1( ˆ )

B-14 Multiple Linear Regression

SUMS OF SQUARES, DEGREES OF FREEDOM, AND MEAN SQUARES

Sums of squares represent variation present in the responses. They are cal-culated by summing squared deviations. Analysis of variance partitions thetotal variation between two sources.

The sums of squares are related by the formula

SST = SSM + SSE

That is, the total variation is partitioned into two parts, one due to themodel and one due to deviations from the model.

Degrees of freedom are associated with each sum of squares. They are re-lated in the same way:

DFT = DFM + DFE

To calculate mean squares, use the formula

MS =sum of squares

degrees of freedom

The ANOVA calculations are displayed in an analysis of variance table,often abbreviated ANOVA table. Here is the general form of the table for mul-tiple regression:

Degrees Sum MeanSource of freedom of squares square F

Model p Σ(yi – y–)2 SSM/DFM MSM/MSEError n – p – 1 Σ(yi – yi)

2 SSE/DFE

Total n – 1 Σ(yi – y–)2 SST/DFT

The null hypothesis H0: �1 = �2 = . . . = �p = 0 that y is not linearly re-lated to any of the explanatory variables can be tested by comparing MSMwith MSE. The ANOVA test statistic is an F statistic. When H0 is true, thisstatistic has an F distribution with p degrees of freedom in the numerator andn – p – 1 degrees of freedom in the denominator. These degrees of freedomare those of MSM and MSE. A typical F density curve is right-skewed. Largervalues of F give evidence against H0.

ANOVA table

chap16.2EY 2002-06-07 14:35 Page 14

Chapter B

Page 15: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

ANOVA for simple linear regressionIn the case of simple linear regression, the hypotheses

H0: �1 = 0

Ha: �1 ≠ 0

are tested by the F statistic F = MSM/MSE. When H0 is true, this statistic hasan F distribution with 1 degree of freedom in the numerator and n – 2 degreesof freedom in the denominator.

The F statistic tests the same null hypothesis as the t statistic we met inChapter 14, so it is not surprising that the two are related. It is an algebraic factthat t2 = F in this case. For linear regression with one explanatory variable, weusually prefer the t form of the test because it more easily allows us to test one-sided alternatives and is closely related to the confidence interval for �1.

Case Study B-15

ANALYSIS OF VARIANCE F TEST

In the multiple regression model, the hypothesis

H0: �1 = �2 = . . . = �p = 0

is tested by the analysis of variance F statistic

The P-value is the probability that a random variable having the F(p, n – p – 1)distribution is greater than or equal to the calculated value of the F statistic.

F = MSMMSE

F

Squared multiple correlation R2

For simple linear regression we noted that the square of the sample correlationcould be written as the ratio of SSM to SST and could be interpreted as theproportion of variation in y explained by x. A similar statistic is routinely cal-culated for multiple regression.

chap16.2EY 2002-06-07 14:35 Page 15

Page 16: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

Often, R2 is multiplied by 100 and expressed as a percent. The square rootof R2, called the multiple correlation coefficient, is the correlation betweenthe observations yi and the predicted values yi.

B-16 Multiple Linear Regression

THE SQUARED MULTIPLE CORRELATION

The statistic

is the proportion of the variation of the response variable y that is explainedby the explanatory variables x1, x2, . . . , xp in a multiple linear regression.

Ry yy y

i

i

22

2= = −−

∑∑

SSMSST

( ˆ )( )

multiple correlation coefficient

A company with a good reputation can attract highly qualified applicants for jobs, cancharge premium prices for its products, and can attract investors. Is reputation, a subjec-tive assessment, related to objective measures of corporate performance? A study designedto answer this and other questions examined the records of 154 Fortune 500 firms.3

Reputation was measured by a Fortune magazine survey, and performance was de-fined as profitability, the return on invested capital. We use a regression to predict theresponse variable profitability (PROFIT) from the explanatory variable reputation(REPUTAT). Figure 16.5 gives the SAS output, including the ANOVA table. The out-put matches the ANOVA table format just given and has an additional column for theP-value. Note that the label C Total is used in the output rather than Total.

EXAMPLE 16.2 FORTUNE 500 COMPANIES

Dependent Variable: PROFIT

Analysis of Variance

Sum of MeanSource DF Squares Square F Value Prob>F

Model 1 0.18957 0.18957 36.492 0.0001Error 152 0.78963 0.00519C Total 153 0.97920

Root MSE 0.07208 R-Square 0.1936Dep Mean 0.10000 Adj R-sq 0.1883C.V. 72.07575

Parameter Estimates

Parameter Standard T for H0:Variable DF Estimate Error Parameter=0 Prob > |T|

INTERCEP 1 -0.147573 0.04139259 -3.565 0.0005REPUTAT 1 0.039111 0.00647442 6.041 0.0001

FIGURE 16.5 Regression output with ANOVA table.

chap16.2EY 2002-06-07 14:35 Page 16

Chapter B

Page 17: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

The calculated value of the F statistic is 36.492 and the P-value is 0.0001.There is strong evidence against the null hypothesis that there is no relationshipbetween reputation and profitability. Note that the t statistic for REPUTAT is6.041. The square of this t is 36.493, which agrees very well with the F statisticgiven on the output. Whenever numbers are rounded off in calculations suchas this, there will be roundoff error.

Note that the slope b1 is positive. We conclude that companies that havebetter reputations also have higher profitability (F = 36.49, df = 1, 152, P =0.0001). We see that r2 = 0.1936; so 19% of the variability in profitability is ex-plained by reputation.

EXERCISES16.8 OXYGEN UPTAKE AND HEART RATE The human body takes in more oxygen when ex-ercising than when it is at rest. To deliver the oxygen to the muscles, the heart mustbeat faster. Heart rate is easy to measure, but measuring oxygen uptake requires elab-orate equipment. If oxygen uptake (VO2) can be accurately predicted from heart rate(HR), the predicted values can replace actually measured values for various researchpurposes. Unfortunately, not all human bodies are the same, so no single predictionequation works for all people. Researchers can, however, measure both HR and VO2for one person under varying sets of exercise conditions and calculate a regressionequation for predicting that person’s oxygen uptake from heart rate. They can thenuse predicted oxygen uptakes in place of measured uptakes for this individual in laterexperiments. Here are data for one individual:4

HR: 94 96 95 95 94 95 94 104 104 106VO2: 0.473 0.753 0.929 0.939 0.832 0.983 1.049 1.178 1.176 1.292

HR: 108 110 113 113 118 115 121 127 131VO2: 1.403 1.499 1.529 1.599 1.749 1.746 1.897 2.040 2.231

(a) Plot the data. Are there any outliers or unusual points?

(b) Compute the least-squares regression line for predicting oxygen uptake from heartrate for this individual.

(c) Test the null hypothesis that the slope of the regression line is 0. Explain in wordsthe meaning of your conclusion from this test.

(d) Calculate a 95% interval for the oxygen uptake of this individual on a future oc-casion when his heart rate is 95. Repeat the calculation for heart rate 110.

(e) From what you have learned in (a), (b), (c), and (d) of this exercise, do you thinkthat the researchers should use predicted VO2 in place of measured VO2 for this indi-vidual under similar experimental conditions? Explain your answer.

16.9 ANOVA FOR OXYGEN UPTAKE AND HEART RATE Return to the oxygen uptake and heartrate data given in the previous exercise.

(a) Construct the ANOVA table.

(b) What null hypothesis is tested by the ANOVA F statistic? What does this hypoth-esis say in practical terms?

Case Study B-17

chap16.2EY 2002-06-07 14:35 Page 17

Page 18: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

(c) Give the degrees of freedom for the F statistic and an approximate P-value for thetest of H0.

(d) Verify that the square of the t statistic that you calculated in Exercise 16.8 is equalto the F statistic in your ANOVA table. (Any difference found is due to roundoff error.)

(e) What proportion of the variation in oxygen uptake is explained by heart rate forthis set of data?

The next exercise uses the CONCEPT file in the Datasets folder on this CD.The data were collected on 78 seventh-grade students in a rural midwestern school.

The research concerned the relationship between the students’ “self-concept” and theiracademic performance. The variables are OBS, a subject identification number; GPA,grade point average; IQ, score on an IQ test; AGE, age in years; SEX, female (1) or male(2); SC, overall score on the Piers-Harris Children’s Self-Concept Scale; and C1 to C6,“cluster scores” for specific aspects of self-concept: C1 = behavior, C2 = school status, C3= physical appearance, C4 = anxiety, C5 = popularity, and C6 = happiness. Here arethe first 5 cases:

OBS GPA IQ AGE SEX SC Cl C2 C3 C4 C5 C6

001 7.940 111 13 2 67 15 17 13 13 11 9002 8.292 107 12 2 43 12 12 7 7 6 6003 4.643 100 13 2 52 11 10 5 8 9 7004 7.470 107 12 2 66 14 15 11 11 9 9005 8.882 114 12 1 58 14 15 10 12 11 6

16.10 Find the correlations between the response variable GPA and each of the ex-planatory variables IQ, AGE, SEX, SC, and Cl to C6. Of all the explanatory variables,IQ does the best job of explaining GPA in a simple linear regression with only onevariable. How do you know this without doing all of the regressions? What percent ofthe variation in GPA can be explained by the straight-line relationship between GPAand IQ?

Regression on high school gradesTo explore the relationship between the explanatory variables and our responsevariable GPA, we run several multiple regressions. The explanatory variablesfall into two classes. High school grades are represented by HSM, HSS, andHSE, and standardized tests are represented by the two SAT scores. We beginour analysis by using high school grades to predict GPA. Figure 16.6 gives themultiple regression output.

The output contains an ANOVA table, some additional descriptivestatistics, and information about the parameter estimates. When examiningany ANOVA table, it is a good idea to first verify the degrees of freedom.This ensures that we have not made some serious error in specifying themodel for the program or in entering the data. Because there are n = 224cases, we have DFT = n – 1 = 223. The three explanatory variables giveDFM = p = 3 and DFE = n – p – 1 = 223 – 3 = 220.

B-18 Multiple Linear Regression

chap16.2EY 2002-06-07 14:35 Page 18

Chapter B

Page 19: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

The ANOVA F statistic is 18.86, with a P-value of 0.0001. Under the nullhypothesis

H0: �1 = �2 = �3 = 0

the F statistic has an F(3, 220) distribution. According to this distribution,the chance of obtaining an F statistic of 18.86 or larger is 0.0001. We there-fore conclude that at least one of the three regression coefficients for thehigh school grades is different from 0 in the population regression equa-tion.

In the descriptive statistics that follow the ANOVA table we find that RootMSE is 0.6998. This value is the square root of the MSE given in the ANOVAtable and is s, the estimate of the parameter � of our model. The value of R2

is 0.20. That is, 20% of the observed variation in the GPA scores is explainedby linear regression on high school grades. Although the P-value is very small,the model does not explain very much of the variation in GPA. Remember, asmall P-value does not necessarily tell us that we have a large effect, particu-larly when the sample size is large.

Case Study B-19

Dependent Variable: GPA

Analysis of Variance

Sum of MeanSource DF Squares Square F Value Prob>F

Model 3 27.71233 9.23744 18.861 0.0001Error 220 107.75046 0.48977C Total 223 135.46279

Root MSE 0.69984 R-Square 0.2046Dep Mean 2.63522 Adj R-sq 0.1937C.V. 26.55711

Parameter Estimates

Parameter Standard T for H0:Variable DF Estimate Error Parameter=0 Prob > |T|

INTERCEP 1 0.589877 0.29424324 2.005 0.0462HSM 1 0.168567 0.03549214 4.749 0.0001HSS 1 0.034316 0.03755888 0.914 0.3619HSE 1 0.045102 0.03869585 1.166 0.2451

FIGURE 16.6 Multiple regression output for regression using high school grades topredict GPA.

estimate of �R2

F statistic

chap16.2EY 2002-06-07 14:35 Page 19

Page 20: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

From the Parameter Estimates section we obtain the fitted regression equation

= 0.590 + 0.169HSM + 0.034HSS + 0.045HSE

Recall that the t statistics for testing the regression coefficients are obtained bydividing the estimates by their standard errors. Thus, for the coefficient ofHSM we obtain the t-value given in the output by calculating

The P-values appear in the last column. Note that these P-values are for the two-sided alternatives. HSM has a P-value of 0.0001, and we conclude that the re-gression coefficient for this explanatory variable is significantly different from 0.The P-values for the other explanatory variables (0.36 for HSS and 0.25 for HSE)do not achieve statistical significance.

Interpretation of resultsThe significance tests for the individual regression coefficients seem to contradictthe impression obtained by examining the correlations in Figure 16.4 (page xxx).In that display we see that the correlation between GPA and HSS is 0.33 and thecorrelation between GPA and HSE is 0.29. The P-values for both of these correla-tions are 0.0001. In other words, if we used HSS alone in a regression to predictGPA, or if we used HSE alone, we would obtain statistically significant regressioncoefficients.

This phenomenon is not unusual in multiple regression analysis. Part ofthe explanation lies in the correlations between HSM and the other two ex-planatory variables. These are rather high (at least compared with the othercorrelations in Figure 16.4). The correlation between HSM and HSS is 0.58,and that between HSM and HSE is 0.45. Thus, when we have a regressionmodel that contains all three high school grades as explanatory variables,there is considerable overlap of the predictive information contained in thesevariables. The significance tests for individual regression coefficients assess thesignificance of each predictor variable assuming that all other predictors areincluded in the regression equation. Given that we use a model with HSMand HSS as predictors, the coefficient of HSE is not statistically significant.Similarly, given that we have HSM and HSE in the model, HSS does nothave a significant regression coefficient. HSM, however, adds significantly toour ability to predict GPA even after HSS and HSE are already in the model.

Unfortunately, we cannot conclude from this analysis that the pair of ex-planatory variables HSS and HSE contribute nothing significant to our modelfor predicting GPA once HSM is in the model. The impact of relations amongthe several explanatory variables on fitting models for the response is the mostimportant new phenomenon encountered in moving from simple linear re-gression to multiple regression. We can only hint at the many complicated prob-lems that arise.

tb

b

= = =SE

0 1685670 03549214

4 749.

..

GPA

B-20 Multiple Linear Regression

t statistics

P-values

fitted regression equation

chap16.2EY 2002-06-07 14:35 Page 20

Chapter B

Page 21: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

ResidualsAs in simple linear regression, we should always examine the residuals as an aidto determining whether the multiple regression model is appropriate for thedata. Because there are several explanatory variables, we must examine severalresidual plots. It is usual to plot the residuals versus the predicted values y andalso versus each of the explanatory variables. Look for outliers, influential ob-servations, evidence of a curved (rather than linear) relation, and anything elseunusual. Again, we leave the task of making these plots as an exercise. The plotsall appear to show more or less random variation around the center value of 0.

If the deviations in the model are normally distributed, the residuals should benormally distributed. Figure 16.7 presents a normal probability plot of the residu-als. The distribution appears to be approximately normal. There are many otherspecialized plots that help detect departures from the multiple regression model.Discussion of these, however, is more than we can undertake in this chapter.

Case Study B-21

Resi

dual

–3

–2

–1

0

1

–3 –2 –1 0 1 2z-score

3

2

FIGURE 16.7 Normal probability plot of the residuals from the high school grade model.There are no important deviations from normality.

Refining the modelBecause the variable HSS has the largest P-value of the three explanatory vari-ables (see Figure 16.6) and therefore appears to contribute the least to our ex-planation of GPA, we rerun the regression using only HSM and HSE asexplanatory variables. The SAS output appears in Figure 16.8. The F statistic indicates that we can reject the null hypothesis that the regression coefficientsfor the two explanatory variables are both 0. The P-value is still 0.0001. Thevalue of R2 has dropped very slightly compared with our previous run, from0.2046 to 0.2016. Dropping HSS from the model resulted in the loss of very lit-tle explanatory power. The measure s of variation about the fitted equation(Root MSE in the printout) is nearly identical for the two regressions, anotherindication that dropping HSS loses very little. The t statistics for the individualregression coefficients indicate that HSM is still clearly significant (P = 0.0001),

chap16.2EY 2002-06-07 14:35 Page 21

Page 22: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

while the coefficient of HSE is larger than before (1.747 versus 1.166) and ap-proaches the traditional 0.05 level of significance (P = 0.082).

Comparison of the fitted equations for the two multiple regression analyses tellsus something more about the intricacies of this procedure. For the first run we have

= 0.590 + 0.169HSM + 0.034HSS + 0.045HSE

whereas the second gives us

= 0.624 + 0.183HSM + 0.061HSE

Eliminating HSS from the model changes the regression coefficients for all ofthe remaining variables and the intercept. This phenomenon occurs quite gen-erally in multiple regression. Individual regression coefficients, their standard er-rors, and significance tests are meaningful only when interpreted in the contextof the other explanatory variables in the model.

EXERCISES16.11 One model for subpopulation means for the computer science study is describedin Example 16.1 (page xxx) as

�GPA = �0 + �1HSM + �2HSS + �3HSE

GPA

GPA

B-22 Multiple Linear Regression

Dependent Variable: GPA

Analysis of Variance

Sum of MeanSource DF Squares Square F Value Prob>F

Model 2 27.30349 13.65175 27.894 0.0001Error 221 108.15930 0.48941C Total 223 135.46279

Root MSE 0.69958 R-Square 0.2016Dep Mean 2.63522 Adj R-sq 0.1943C.V. 26.54718

Parameter Estimates

Parameter Standard T for H0:Variable DF Estimate Error Parameter=0 Prob > |T|

INTERCEP 1 0.624228 0.29172204 2.140 0.0335HSM 1 0.182654 0.03195581 5.716 0.0001HSE 1 0.060670 0.03472914 1.747 0.0820

FIGURE 16.8 Multiple regression output for regression using HSM and HSE to predict GPA.

chap16.2EY 2002-06-07 14:35 Page 22

Chapter B

Page 23: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

(a) Give the model for the subpopulation mean GPA for students having high schoolgrade scores HSM = 9 (A–), HSS = 8 (B+), and HSE = 7 (B).

(b) Using the parameter estimates given in Figure 16.6 (page 929), calculate the esti-mate of this subpopulation mean. Then briefly explain in words what your numericalanswer means.

16.12 Use the model given in the previous exercise to do the following:

(a) For students having high school grade scores HSM = 6 (B–), HSS = 7 (B), andHSE = 8 (B+), express the subpopulation mean in terms of the parameters �j.

(b) Calculate the estimate of this subpopulation mean using the bj given in Figure 16.6(page 929). Briefly explain the meaning of the number you obtain.

16.13 Consider the regression problem of predicting GPA from the three high schoolgrade variables. The computer output appears in Figure 16.6 (page 929).

(a) Use information given in the output to calculate a 95% confidence interval for theregression coefficient of HSM. Explain in words what this regression coefficient meansin the context of this particular model.

(b) Do the same for the explanatory variable HSE.

16.14 Computer output for the regression run for predicting GPA from the high schoolgrade variables HSM and HSE appears in Figure 16.8 (page 932). Answer (a) and (b)of the previous exercise for this regression. Why are the results for this exercise differ-ent from those that you calculated for the previous exercise?

16.15 Consider the regression problem of predicting GPA from the three high schoolgrade variables. The computer output appears in Figure 16.6 (page 929).

(a) Write the estimated regression equation.

(b) What is the value of s, the estimate of �?

(c) State the H0 and Ha tested by the ANOVA F statistic for this problem. After stat-ing the hypotheses in symbols, explain them in words.

(d) What is the distribution of the F statistic under H0? What conclusion do you drawfrom the F test?

(e) What percent of the variation in GPA is explained by these three high school gradevariables?

Regression on SAT scoresWe now turn to the problem of predicting GPA using the two SAT scores.Figure 16.9 gives the output. The fitted model is

= 1.289 + 0.002283SATM – 0.000025SATV

The degrees of freedom are as expected: 2, 221, and 223. The F statistic is7.476, with a P-value of 0.0007. We conclude that the regression coefficients

GPA

Case Study B-23

chap16.2EY 2002-06-07 14:35 Page 23

Page 24: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

for SATM and SATV are not both 0. Recall that we obtained the P-value0.0001 when we used high school grades to predict GPA. Both multiple re-gression equations are highly significant, but this obscures the fact that the twomodels have quite different explanatory power. For the SAT regression, R2 =0.0634, whereas for the high school grades model even with only HSM andHSE (Figure 16.8), we have R2 = 0.2016, a value more than three times aslarge. Stating that we have a statistically significant result is quite different fromsaying that an effect is large or important.

Further examination of the output in Figure 16.9 reveals that the coefficientof SATM is significant (t = 3.44, P = 0.0007), and that for SATV is not (t = –0.04,P = 0.9684). For a complete analysis we should carefully examine the residuals.Also, we might want to run the analysis with SATM as the only explanatory variable.

B-24 Multiple Linear Regression

Dependent Variable: GPA

Analysis of Variance

Sum of MeanSource DF Squares Square F Value Prob>F

Model 2 8.58384 4.29192 7.476 0.0007Error 221 126.87895 0.57411C Total 223 135.46279

Root MSE 0.75770 R-Square 0.0634Dep Mean 2.63522 Adj R-sq 0.0549C.V. 28.75287

Parameter Estimates

Parameter Standard T for H0:Variable DF Estimate Error Parameter=0 Prob > |T|

INTERCEP 1 1.288677 0.37603684 3.427 0.0007SATM 1 0.002283 0.00066291 3.444 0.0007SATV 1 -0.000024562 0.00061847 -0.040 0.9684

FIGURE 16.9 Multiple regression output for regression using SAT scores to predict GPA.

Regression using all variablesWe have seen that either the high school grades or the SAT scores give ahighly significant regression equation. The mathematics component of eachof these groups of explanatory variables appears to be the key predictor.Comparing the values of R2 for the two models indicates that high school gradesare better predictors than SAT scores. Can we get a better prediction equationusing all of the explanatory variables together in one multiple regression?

chap16.2EY 2002-06-07 14:35 Page 24

Chapter B

Page 25: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

To address this question we run the regression with all five explanatory vari-ables. The output appears in Figures 16.10. The F statistic is 11.69, with a P-value of 0.0001, so at least one of our explanatory variables has a nonzero re-gression coefficient. This result is not surprising given that we have alreadyseen that HSM and SATM are strong predictors of GPA. The value of R2 is0.2115, not much higher than the value of 0.2046 that we found for the highschool grades regression.

Examination of the t statistics and the associated P-values for the individual re-gression coefficients reveals that HSM is the only one that is significant (P =0.0003). That is, only HSM makes a significant contribution when it is added to amodel that already has the other four explanatory variables. Once again it is im-portant to understand that this result does not necessarily mean that the regressioncoefficients for the four other explanatory variables are all 0. Many statistical soft-ware packages provide the capability for testing whether a collection of regressioncoefficients in a multiple regression model are all 0. We use this approach to ad-dress two interesting questions about this set of data. We did not discuss such testsin the outline that opened this section, but the basic idea is quite simple.

Case Study B-25

Dependent Variable: GPA

Analysis of Variance

Sum of MeanSource DF Squares Square F Value Prob>F

Model 5 28.64364 5.72873 11.691 0.0001Error 218 106.81914 0.49000C Total 223 135.46279

Root MSE 0.70000 R-Square 0.2115Dep Mean 2.63522 Adj R-sq 0.1934C.V. 26.56311

Parameter Estimates

Parameter Standard T for H0:Variable DF Estimate Error Parameter=0 Prob > |T|

INTERCEP 1 0.326719 0.39999643 0.817 0.4149SATM 1 0.000944 0.00068566 1.376 0.1702SATV 1 -0.000408 0.00059189 -0.689 0.4915HSM 1 0.145961 0.03926097 3.718 0.0003HSS 1 0.035905 0.03779841 0.950 0.3432HSE 1 0.055293 0.03956869 1.397 0.1637

Test: SAT Numerator: 0.4657 DF: 2 F value: 0.9503Denominator: 0.489996 DF: 218 Prob>F: 0.3882

Test: HS Numerator: 6.6866 DF: 3 F value: 13.6462Denominator: 0.489996 DF: 218 Prob>F: 0.0001

FIGURE 16.10 Multiple regression output for regression using all variables to predict GPA.

chap16.2EY 2002-06-07 14:35 Page 25

Page 26: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

Figure 16.11 gives the Excel and Minitab multiple regression output forthis problem. Although the format and organization of output differ amongsoftware packages, the basic results that we need are easy to find.

B-26 Multiple Linear Regression

The regression equation isGPA = 0.327 + 0.146 HSM + 0.0359 HSS + 0.0553 HSE +0.000944 SATM -0.000408 SATV

Predictor Coef StDev T PConstant 0.3267 0.4000 0.82 0.415HSM 0.14596 0.03926 3.72 0.000HSS 0.03591 0.03780 0.95 0.343HSE 0.05529 0.03957 1.40 0.164SATM 0.0009436 0.0006857 1.38 0.170SATV -0.0004078 0.0005919 -0.69 0.492

S = 0.7000 R-Sq = 21.1% R-Sq(adj) = 19.3%

Analysis of Variance

Source DF SS MS F PRegression 5 28.6436 5.7287 11.69 0.000Error 218 106.8191 0.4900Total 223 135.4628

Minitab

Excel

(a)

(b)

FIGURE 16.11 Excel and Minitab multiple regression output for regression using all vari-ables to predict GPA.

chap16.2EY 2002-06-07 14:35 Page 26

Chapter B

Page 27: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

Test for a collection of regression coefficientsIn the context of the multiple regression model with all five predictors, we askfirst whether or not the coefficients for the two SAT scores are both 0. In otherwords, do the SAT scores add any significant predictive information to that al-ready contained in the high school grades? To be fair, we also ask the comple-mentary question—do the high school grades add any significant predictiveinformation to that already contained in the SAT scores?

The answers are given in the last two parts of the output in Figure 16.10. Forthe first test we see that F = 0.9503. Under the null hypothesis that the two SATcoefficients are 0, this statistic has an F(2, 218) distribution and the P-value is0.39. We conclude that the SAT scores are not significant predictors of GPA in aregression that already contains the high school scores as predictor variables.Recall that the model with just SAT scores has a highly significant F statistic. Wenow see that whatever predictive information is in the SAT scores can also befound in the high school grades. In this sense, the SAT scores are superfluous.

The test statistic for the three high school grade variables is F = 13.6462.Under the null hypothesis that these three regression coefficients are 0, thestatistic has an F(3, 218) distribution and the P-value is 0.0001. We concludethat high school grades contain useful information for predicting GPA that isnot contained in SAT scores.

Of course, our statistical analysis of these data does not imply that SATscores are less useful than high school grades for predicting college grades forall groups of students. We have studied a select group of students—computerscience majors—from a specific university. Generalizations to other situationsare beyond the scope of inference based on these data alone.

Chapter Review B-27

CHAPTER REVIEWWe can use a multiple linear regression model to describe the relationship be-tween a single response variable y and p explanatory variables x1, x2, . . . , xpwhen

• the mean response �y = �0 + �1x1 + �2x2 + . . . + �pxp is a linear function ofthe explanatory variables, and

• the residuals are independent and normally distributed with mean 0 andstandard deviation �.

The parameters of the model are �0, �1, �2, . . . , �p, and �. The �’s are estimated by b0, b1, b2, . . . , bp, which are obtained by the

method of least squares. The parameter � is estimated by

sn p

i= =− −

∑MSE

residual2

1

chap16.2EY 2002-06-07 14:35 Page 27

Page 28: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

where the residuals

residuali = yi – yi

A level C confidence interval for �j is

bj ± t*

where t* is the value for the t(n – p – 1) density curve with area C between –t*and t*.

The test of the hypothesis H0: �j = 0 is based on the t statistic

and the t(n – p –1) distribution.The estimate bj of �j and the test and confidence interval for �j are all based on

a specific multiple linear regression model. The results of all of these procedureschange if other explanatory variables are added to or deleted from the model.

The ANOVA table for a multiple linear regression gives the degrees offreedom, sum of squares, and mean squares for the model, error, and totalsources of variation. The ANOVA F statistic is the ratio MSM/MSE and isused to test the null hypothesis

H0: �1 = �2 = . . . = �p = 0

If H0 is true, this statistic has an F(p, n – p – 1) distribution.The squared multiple correlation is given by the expression

and is interpreted as the proportion of the variability in the response variable ythat is explained by the explanatory variables x1, x2 , . . . , xp in the multiple lin-ear regression.

R2 = SSMSST

tbj

bj

=SE

bjSE

B-28 Multiple Linear Regression

CHAPTER 16 REVIEW EXERCISES16.16 Figure 16.9 (page 934) gives the output for the regression analysis in which thetwo SAT scores are used to predict GPA.

(a) Write the estimated regression equation.

(b) What is the value of s, the estimate of �?

(c) State the H0 and Ha tested by the ANOVA F statistic for this problem. After stat-ing the hypotheses in symbols, explain them in words.

chap16.2EY 2002-06-07 14:35 Page 28

Chapter B

Page 29: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

(d) What is the distribution of the F statistic under H0? What conclusion do you drawfrom the F test?

(e) What percent of the variation in GPA is explained by these SAT score variables?

The following six exercises relate to the case study described in this chapter. They requireuse of the CSDATA file in the Datasets folder on this CD.

16.17 Regress GPA on the three high school grade variables. Calculate and store theresiduals from this regression. Plot the residuals versus each of the three predictors andversus the predicted value of GPA. Are there any unusual points or patterns in thesefour plots?

16.18 Use the two SAT scores in a multiple regression to predict GPA. Calculate andstore the residuals. Plot the residuals versus each of the explanatory variables and ver-sus the predicted GPA. Describe the plots.

16.19 It appears that the mathematics explanatory variables are strong predictors ofGPA in the computer science study. Run a multiple regression using HSM and SATMto predict GPA.

(a) Give the fitted regression equation.

(b) State the H0 and Ha tested by the ANOVA F statistic, and explain their meaningin plain language. Report the value of the F statistic, its P-value, and your conclusion.

(c) Give 95% confidence intervals for the regression coefficients of HSM and SATM.Do either of these include the point 0?

(d) Report the t statistics and P-values for the tests of the regression coefficients ofHSM and SATM. What conclusions do you draw from these tests?

(e) What is the value of s, the estimate of �?

(f) What percent of the variation in GPA is explained by HSM and SATM in yourmodel?

16.20 How well do verbal variables predict the performance of computer science stu-dents? Perform a multiple regression analysis to predict GPA from HSE and SATV.Summarize the results and compare them with those obtained in the previous exer-cise. In what ways do the regression results indicate that the mathematics variables arebetter predictors?

16.21 The variable SEX has the value 1 for males and 2 for females. Create a data setcontaining the values for males only. Run a multiple regression analysis for predictingGPA from the three high school grade variables for this group. Using the case study inthe text as a guide, interpret the results and state what conclusions can be drawn fromthis analysis. In what way (if any) do the results for males alone differ from those for allstudents?

16.22 Refer to the previous exercise. Perform the analysis using the data for femalesonly. Are there any important differences between female and male students in pre-dicting GPA?

Chapter Review B-29

chap16.2EY 2002-06-07 14:35 Page 29

Page 30: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

The next two exercises require use of the CONCEPT file in the Datasets folder on this CD.

16.23 Let us look at the role of positive self-concept about one’s physical appearance (vari-able C3). We include IQ in the model because it is a known important predictor of GPA.

(a) Report the fitted regression model

= b0 + b1IQ + b2C3

with its R2, and report the t statistic for significance of self-concept about one’s physi-cal appearance with its P-value. Does C3 contribute significantly to explaining GPAwhen added to IQ? How much does adding C3 to the model raise R2?

(b) Now start with the model that includes overall self-concept SC along with IQ.Does C3 help explain GPA significantly better? Does it raise R2 enough to be of prac-tical value? In answering these questions, report the fitted regression model

= b0 + b1IQ + b2C3 + b3SC

with its R2 and the t statistic for significance of C3 with its P-value.

(c) Explain carefully in words why the coefficient b2 for the variable C3 takes quitedifferent values in the regressions of parts (a) and (b). Then explain simply how it canhappen that C3 is a useful explanatory variable in part (a) but worthless in part (b).

16.24 A reasonable model explains GPA using IQ, C1 (behavior self-concept), and C5(popularity self-concept). These three explanatory variables are all significant in thepresence of the other two, and no other explanatory variable is significant when addedto these three. (You do not have to verify these statements.) Let’s use this model.

(a) What is the fitted regression model, its R2, and the standard deviation s about thefitted model? What GPA does this model predict for a student with IQ 109, C1 score13, and C5 score 8?

(b) If both C1 and C5 could be held constant, how much would GPA increase foreach additional point of IQ score, according to the fitted model? Give a 95% confi-dence interval for the mean increase in GPA in the entire population if IQ could beincreased by 1 point while holding C1 and C5 constant.

(c) Compute the residuals for this model. Plot the residuals against the predicted val-ues and also make a normal probability plot. Which observation (value of OBS) pro-duces the most extreme residual? Circle this observation in both of your plots of theresiduals. For which individual variables does this student have very unusual values?What are these values? (Look at all the variables, not just those in the model.)

(d) Repeat part (a) with this one observation removed. How much did this one studentaffect the fitted model and the prediction?

The following three exercises use the CHEESE file from the Datasets folder on this CD.

16.25 Carry out a multiple regression using Acetic and H2S to predict Taste.Summarize the results of your analysis. Compare the statistical significance of Aceticin this model with its significance in the model with Acetic alone as a predictor

GPA

GPA

B-30 Multiple Linear Regression

chap16.2EY 2002-06-07 14:35 Page 30

Chapter B

Page 31: Multiple Linear Regression - captainmath.netcaptainmath.net/apstats/StudentCD/TPS3e/content/pdf/... · 2014. 7. 24. · chap16.2EY 2002-06-07 14:35 Page 6 Chapter B. Relationships

(Exercise 16.5, page xxx). Which model do you prefer? Give a simple explanation forthe fact that Acetic alone appears to be a good predictor of Taste, but with H2S in themodel, it is not.

16.26 Carry out a multiple regression using H2S and Lactic to predict Taste.Comparing the results of this analysis with the simple linear regressions using each ofthese explanatory variables alone, it is evident that a better result is obtained by usingboth predictors in a model. Support this statement with explicit information obtainedfrom your analysis.

16.27 Use the three explanatory variables Acetic, H2S, and Lactic in a multiple re-gression to predict Taste. Write a short summary of your results, including an exami-nation of the residuals. Based on all of the regression analyses you have carried out onthese data, which model do you prefer and why?

Chapter Review B-31

NOTES AND DATA SOURCES1. Results of the study are reported in P. F. Campbell and G. P. McCabe,“Predicting the success of freshmen in a computer science major,” Communicationsof the ACM, 27 (1984), pp. 1108–1113.2. These data are based on experiments performed by G. T. Lloyd and E. H. Ramshaw of the CSIRO Division of Food Research, Victoria, Australia. Someresults of the statistical analyses of these data are given in G. P. McCabe, L.McCabe, and A. Miller, “Analysis of taste and chemical composition of cheddarcheese, 1982–83 experiments,” CSIRO Division of Mathematics and StatisticsConsulting Report VT85/6, and in I. Barlow et al., “Correlations and changes inflavour and chemical parameters of cheddar cheeses during maturation,” AustralianJournal of Dairy Technology, 44 (1989), pp. 7–18. 3. This example is based on summaries in Charles Fombrun and Mark Shanley,“What’s in a name? Reputation building and corporate strategy,” Academy ofManagement Journal, 33 (1990), pp. 233–258.4. Data provided by Paul Waldsmith from experiments conducted in DonCorrigan’s laboratory at Purdue University.

chap16.2EY 2002-06-07 14:35 Page 31