section 12.3

HAWKES LEARNING SYSTEMS

Students Matter. Success Counts.

Copyright © 2013 by Hawkes Learning

Systems/Quant Systems, Inc.

All rights reserved.

Section 12.3

Regression Analysis






Objectives

o Construct a prediction interval for an individual value of y.

o Construct confidence intervals for the slope and the y-intercept of a regression line.






Regression Analysis

Residual A residual is the difference between the actual value of y from the original data and the predicted value of ŷ found using the regression line, given by

Residual = y − ŷwhere y is the observed value of the response variable and ŷ is the predicted value of y using the least-squares regression model.






Example 12.15: Calculating Residuals Using an Estimated Regression Equation

The following table gives data from a local school district on children’s ages (x) and reading levels (y). For these data, a reading level of 4.3 would indicate that

the child’s reading level is of the year through the

fourth grade. The children’s ages are given in years.

310

Ages and Reading Levels Age (in Years), x 6 7 8 9 10 11 12 13 14 15Reading Level, y 1.3 2.2 3.7 4.1 4.9 5.2 6.0 7.1 8.5 9.7






Example 12.15: Calculating Residuals Using an Estimated Regression Equation (cont.)

Using a TI-83/84 Plus calculator to determine the linear regression model, we calculate the regression line to be ŷ = −3.811 + 0.865x. Note that r ≈ 0.989, which is greater than the critical value at the 0.05 level of significance,

0.05 0.632.r Furthermore, the following scatter plot depicts the linear pattern of the data values. Therefore, it is appropriate to use this linear regression model to make predictions.







Use the regression equation to calculate an estimate, ŷ, for each value of x, and then use the estimate to calculate the residual for each value of y.







SolutionWe can use a TI-83/84 Plus calculator to perform all of the necessary calculations at once. Age is the explanatory variable, x, and reading level is the response variable, y. • Press .• Select option 1:Edit. • Enter the ages in L1 and the reading levels in L2.







• Use the arrow keys to highlight L3 and enter the formula -3.811+0.865*L1. This will calculate the predicted y-value for each x-value. • Highlight L4 and enter the formula L2ÞL3. This formula will calculate each of the residuals. The results will be as follows.







Predicted Values and Residuals Age (in Years), x Reading Level, y Predicted Value, ŷ Residual, y - ŷ

6 1.3 1.379 -0.0797 2.2 2.244 -0.0448 3.7 3.109 0.5919 4.1 3.974 0.126

10 4.9 4.839 0.06111 5.2 5.704 -0.50412 6.0 6.569 -0.56913 7.1 7.434 -0.33414 8.5 8.299 0.20115 9.7 9.164 0.536






Regression Analysis

Sum of Squared Errors (SSE) The sum of squared errors (SSE) for a regression line is the sum of the squares of the residuals, given by

where yi is the ith observed value of the response variable and ŷi is the predicted value of yi using the least-squares regression model.

2ˆSSE i iy y -






Example 12.16: Calculating the Sum of Squared Errors

Calculate the sum of squared errors, SSE, for the data on children’s ages and reading levels from the previous example. SolutionUsing the values we calculated in the previous example, we begin by squaring each error as shown in the following table.






Example 12.16: Calculating the Sum of Squared Errors (cont.)

Squared Errors Age (in Years), x

Reading Level, y

Predicted Value, ŷ

Residual, y - ŷ

Squared Error

6 1.3 1.379 -0.079 0.0062417 2.2 2.244 -0.044 0.0019368 3.7 3.109 0.591 0.3492819 4.1 3.974 0.126 0.01587610 4.9 4.839 0.061 0.00372111 5.2 5.704 -0.504 0.25401612 6.0 6.569 -0.569 0.32376113 7.1 7.434 -0.334 0.111556

2ˆy y-







Squared Errors Age (in Years), x

Reading Level, y

Predicted Value, ŷ

Residual, y - ŷ

Squared Error

14 8.5 8.299 0.201 0.04040115 9.7 9.164 0.536 0.287296

2ˆy y-







The last column lists the squares of the residual values. The sum of the squared errors is the sum of the values in this last column. Thus, SSE ≈ 1.394.






Regression Analysis

Standard Error of Estimate The standard error of estimate, which is a measure of how much the sample data points deviate from the regression line, is given by

2ˆ

2

SSE2

i ie

y yS

n

n

-

-

-






Regression Analysis

Standard Error of Estimate (cont.)

where yi is the ith observed value of the response variable, ŷi is the predicted value of yi using the least-squares regression model, n is the number of data pairs in the sample, and SSE is the sum of squared errors.






Example 12.17: Calculating the Standard Error of Estimate Using a TI-83/84 Plus Calculator

Calculate the standard error of estimate for the data on children’s ages and reading levels from Example 12.15 (repeated in the following table).







Example 12.17: Calculating the Standard Error of Estimate Using a TI-83/84 Plus Calculator (cont.)

SolutionBegin as follows. • Press . • Choose 1:Edit. • Enter the age data into L1 and the reading-level data in L2. • Press . • Choose TESTS. • Choose option F:LinRegTTest.







• Enter L1 for the Xlist and L2 for the Ylist. The value entered for the option Freq should be 1. • Choose ø0 for the alternative hypothesis to test the significance of the linear relationship. • Enter the regression equation into RegEQ if you have

already calculated it. If not, you may leave this blank. • Choose Calculate. • Press .







The results, shown in the following screenshots, include the t-test statistic for testing the significance of the linear relationship. The calculator also gives us the p value for that hypothesis test and the number of ‑degrees of freedom. The slope and y-intercept of the regression line are also given. Note that the regression line is given in the form y = a + bx, so a is the y intercept and b is the slope, which is the opposite of the results that we get when we use the LinReg(ax+b) function.







The last two values given are the coefficient of determination and the correlation coefficient. The standard error of estimate is s, the third to last value ‑ ‑given.







Thus, the standard error of estimate for the data on ages and reading levels is Se ≈ 0.417. Since this value is close to 0, we can conclude that the data points do not deviate very much from the regression line.






Prediction Interval for an Individual y-Value

Prediction interval A prediction interval is a confidence interval for an individual value of the response variable, y, at a given fixed value of the explanatory variable, x.







Margin of Error of a Prediction Interval for an Individual y-Value

The margin of error of a prediction interval for an individual value of the response variable, y, is given by

2

02 22

11e

i i

n x xE t S

n n x x

-

-







Margin of Error of a Prediction Interval for an Individual y-Value (cont.)

Where is the critical value for the level of confidence, c = 1 - a, such that the area under the t distribution with ‑ n - 2 degrees of freedom to the

right of is equal to

Se is the standard error of estimate,

n is the number of data pairs in the sample,

2t

2t ,2







Margin of Error of a Prediction Interval for an Individual y-Value (cont.)

x0 is the fixed value of the explanatory variable, x,

x̄� is the mean of the x-values for the data points in the sample, and

xi is the ith value of the explanatory variable.








The prediction interval for an individual value of the response variable, y, is given by

where ŷ is the predicted value of the response variable, y, when x = x0 and E is the margin of error.

ˆ ˆ

orˆ ˆ,

y E y y E

y E y E

-

-






Example 12.18: Constructing a Prediction Interval for an Individual y-Value

Construct a 95% prediction interval for the reading level of a child who is 8 years old. Use the data from Example 12.15 on children’s ages and reading levels as the sample data (repeated in the following table).







Example 12.18: Constructing a Prediction Interval for an Individual y-Value (cont.)

SolutionNeither a TI-83/84 Plus calculator nor Microsoft Excel will directly calculate a prediction interval1, so we must calculate the margin of error by hand and use this value to construct the prediction interval.Step 1: Find the regression equation for the sample data. We know from previous examples that the regression equation is as follows.

ˆ 3.811 0.865y x-







Step 2: Use the regression equation to calculate the point estimate, ŷ, for the given value of x. In this example, x = 8. Thus, we have the following.

Step 3: Calculate the sample statistics necessary to calculate the margin of error.

ˆ 3.811 0.865ˆ 3.811 0.865 8

3.109ˆ

y x

y

y

-

-

1 However, many statistical software packages, such as Minitab, will directly calculate a prediction interval.







Using a TI-83/84 Plus calculator, we can enter the values for age in L1 and the values for reading level in L2. Next, press , select CALC, and then choose option 2:2-Var Stats. This will give us many of the statistics we need.

2

10.5, 105,

1185, 10i

i

x x

x n







Next, recall that we found that Se ≈ 0.417442 in the previous example. This value was also found using a TI 83/84 Plus calculator. Lastly, using the t-distribution table or appropriate technology, we find the critical value for this test, for the t distribution with n - 2 = 10 - 2 = 8 degrees of freedom.

2 0.05 2 0.025 2.306tt t







Step 4: Find the margin of error. Substituting the necessary statistics into the formula for the margin of error, we obtain the following.

2

02 22

2

2

11

10 8 10.512.306 0.417442 1

10

1.043793

10 1185 105

e

i i

n x xt S

n nE

x x

-

-

-

-







Step 5: Subtract the margin of error from and add the margin of error to the point estimate. Subtracting the margin of error from the point estimate of ŷ = 3.109 gives us the lower endpoint for the prediction interval.

2.0

ˆLower endpoint: 3.109 1.04365

793y E- -







By adding the margin of error to the point estimate, we obtain the upper endpoint for the prediction interval as follows.

4.1

ˆUpper endpoint: 3.109 1.04353

793y E







Thus the 95% confidence interval for the individual y value ranges from 2.065 to 4.153. The confidence interval can be written mathematically using either inequality symbols or interval notation, as shown below.

or

2.

2.

065

0

4.

65

15

, 5

3

4.1 3

y







Thus, for an 8-year-old child, we can be 95% confident that he or she would have a reading level between 2.065 and 4.153, or be reading between the second and fourth grade levels.






Example 12.19: Constructing Confidence Intervals for β1 and β0 Using Microsoft Excel

Construct 95% confidence intervals for the slope, b1, and the y-intercept, b0, of the regression equation for age and reading level. Use the sample data from Example 12.15 (repeated in the following table).







Example 12.19: Constructing Confidence Intervals for β1 and β0 Using Microsoft Excel (cont.)

SolutionBegin by entering the sample data into Microsoft Excel as shown in the following screenshot







Under the Data tab, choose Data Analysis. Select Regression from the options listed. Enter the necessary information into the Regression menu as shown in the following screenshot. Click OK.







The results, shown in the following screenshot, provide an abundance of information, much of which we have discussed throughout this chapter.

1

23

4 5

7 6







Multiple R is the absolute value of the correlation coefficient, |r|. R Square is the coefficient of determination, r2. Standard Error is the standard error of estimate,

Se.

The ANOVA table will be discussed in the next section, since it is more meaningful when discussing more than one explanatory variable. However, it does contain a few of the important values we discussed so far in this section.

1

2

3







The intersection of the Residual row and the SS column is the sum of squared errors, SSE. 5 The Lower 95.0% and Upper 95.0% columns

give the lower and upper endpoints of the 95% confidence intervals for the y-intercept and slope.

The Coefficients column gives the values for the coefficients, that is, the y-intercept and slope, of the regression line.

4

5

6 7







The lower and upper endpoints of the 95% confidence intervals for the y-intercept and slope are the values we are interested in for this example.The row labeled Intercept is the row for the values corresponding to the y-intercept. Notice that the first value in this row is b0 ≈ −3.811. The last two values in this row are the lower and upper endpoints for a 95% confidence interval for the y-intercept of the regression line, b0. Thus, the 95% confidence interval for b0 can be written as follows.







The row labeled Age is the row for the values corresponding to the slope of the regression line. It is labeled Age instead of Slope because it is possible to have more than one explanatory variable, in which case there would be a separate row for each variable, labeled with the variable’s name.

0

or4.965 2.657

4.965, 2.657- -

- b -







The first value in this row is b1 ≈ 0.865. The last two values in this row are the lower and upper endpoints for a 95% confidence interval for the slope of the regression line, b1. Thus, the 95% confidence interval for b1 can be written as follows.

1

or0.7

0.7

59 0

59

.971

, 0.971

b

section 12.3

Documents

regression line

value of x

success counts

linear regression model

regression analysis

predicted value of y

predicted yvalue

squares regression model