statistics - site.iugaza.edu.pssite.iugaza.edu.ps/mriffi/files/2018/02/ch14.pdf · 14.1 testing the...

Copyright © 2017, 2013, 2010 Pearson Education, Inc. All Rights Reserved

STATISTICSINFORMED DECISIONS USING DATAFifth Edition

Chapter 14

Inference on the Least-Squares

Regression Model and Multiple Regression


14.1 Testing the Significance of the Least-Squares Regression ModelLearning Objectives

1. State the requirements of the least-squares regression model

2. Compute the standard error of the estimate

3. Verify that the residuals are normally distributed

4. Conduct inference on the slope of the least-squares regression model

5. Construct a confidence interval about the slope of the least-squares regression model


14.1 Testing the Significance of the Least-Squares Regression Model14.1.1 State the Requirements of the Least-Squares Regression Model (1 of 5)

Requirement 1 for Inference on the Least-Squares Regression Model

For any particular value of the explanatory variable x, the mean of the corresponding responses in the population depends linearly on x. That is,



Requirement 2 for Inference on the Least-Squares Regression Model



“In Other Words”

When doing inference on the least-squares regression model, we require (1) for any explanatory variable, x, the mean of the response variable, y, depends on the value of x through a linear equation, and (2) the response variable, y, is normally distributed with a constant standard deviation, σ. The mean increases/ decreases at a constant rate depending on the slope, while the standard deviation remains constant.



A large value of σ, the population standard deviation, indicates that the data are widely dispersed about the regression line, and a small value of σ indicates that the data lie fairly close to the regression line.



where

yi is the value of the response variable for the ith individual

xi is the value of the explanatory variable for the ith

individual

β0 and β1 are the parameters to be estimated based on sample data


14.1 Testing the Significance of the Least-Squares Regression Model14.1.2 Compute the Standard Error of the Estimate (1 of 7)

The standard error of the estimate, se, is found using the formula



Parallel Example 2: Compute the Standard Error

Compute the standard error of the estimate for the drilling data which is presented on the next slide.



Depth at Which Drilling Begins, x (in feet)

Time to Drill 5 Feet, y (in minutes)

35 5.88

50 5.99

75 6.74

95 6.1

120 7.47

130 6.93

145 6.42

155 7.97

160 7.92

175 7.62

185 6.89

190 7.9



Step 2, 3: The predicted values as well as the residuals for the 12 observations are given in the table on the next slide



Solution

Step 4: We find the sum of the squared residuals by summing the last column of the table:



CAUTION!

Be sure to divide by n − 2 when computing the standard error of the estimate.


14.1 Testing the Significance of the Least-Squares Regression Model14.1.3 Verify That the Residuals Are Normally Distributed (1 of 2)

Parallel Example 4: Compute the Standard Error

Verify that the residuals from the drilling example are normally distributed.


14.1 Testing the Significance of the Least-Squares Regression Model14.1.3 Verify That the Residuals Are Normally Distributed (2 of 2)


14.1 Testing the Significance of the Least-Squares Regression Model14.1.4 Conduct Inference on the Slope of the Least-Squares Regression

Model (1 of 24)

Hypothesis Test Regarding the Slope Coefficient, β1

To test whether two quantitative variables are linearly related, we use the following steps provided that

1. the sample is obtained using random sampling.

2. the residuals are normally distributed with constant error variance.



Model (2 of 24)

Step 1: Determine the null and alternative hypotheses. The hypotheses can be structured in one of three ways:

Two-tailed Left-Tailed Right-Tailed

H0: β1 = 0 H0: β1 = 0 H0: β1 = 0

H1: β1 ≠ 0 H1: β1 < 0 H1: β1 > 0



Model (3 of 24)

which follows Student’s t-distribution with n − 2 degrees of freedom. Remember, when computing the test statistic, we assume the null hypothesis to be true. So, we assume that β1 = 0. Use Table VII to determine the critical value using n − 2 degrees of freedom.



Model (4 of 24)

Classical Approach

Two-Tailed



Model (5 of 24)

Classical Approach

Left-Tailed



Model (6 of 24)

Classical Approach

Right-Tailed



Model (7 of 24)

Classical Approach

Step 4: Compare the critical value with the test statistic.



Model (8 of 24)

P-value Approach

By Hand Step 3: Compute the test statistic



Model (9 of 24)

P-value Approach

Two-Tailed



Model (10 of 24)

P-value Approach

Left-Tailed



Model (11 of 24)

P-value Approach

Right-Tailed



Model (12 of 24)

P-value Approach

Technology Step 3: Use a statistical spreadsheet or calculator with statistical capabilities to obtain the P-value. The directions for obtaining the P-value using the TI-83/84 Plus graphing calculators, Minitab, Excel, and StatCrunch are in the Technology Step-by-Step in the text.



Model (13 of 24)

P-value Approach

Step 4: If the P-value < α, reject the null hypothesis.



Model (14 of 24)

P-value Approach

Step 5: State the conclusion.



Model (15 of 24)

CAUTION!

Before testing H0: β1 = 0, be sure to draw a residual plot to verify that a linear model is appropriate.



Model (16 of 24)

Parallel Example 5: Testing for a Linear Relation

Test the claim that there is a linear relation between drill depth and drill time at the α = 0.05 level of significance using the drilling data.



Model (17 of 24)

Solution

Verify the requirements:

• We assume that the experiment was randomized so that the data can be assumed to represent a random sample.

• In Parallel Example 4 we confirmed that the residuals were normally distributed by constructing a normal probability plot.

• To verify the requirement of constant error variance, we plot the residuals against the explanatory variable, drill depth.



Model (18 of 24)

There is no discernable pattern.



Model (19 of 24)

Solution

Step 1: We want to determine whether a linear relation exists between drill depth and drill time without regard to the sign of the slope. This is a two-tailed test with

H0: β1 = 0 versus H1: β1 ≠ 0

Step 2: The level of significance is α = 0.05.



Model (20 of 24)



Model (21 of 24)

Solution

Step 3, cont’d: We have



Model (22 of 24)

Solution: Classical Approach

Step 3: cont’d Since this is a two-tailed test, we determine the critical t-values at the α = 0.05 level of significance with n − 2 = 12 − 2 = 10 degrees of freedom to be −t0.025 = −2.228 and t0.025 = 2.228.

Step 4: Since the value of the test statistic, 3.867, is greater than 2.228, we reject the null hypothesis.



Model (23 of 24)

Solution: P-value Approach

Step 3: Since this is a two-tailed test, the P-value is the sum of the area under the t-distribution with 12 − 2 = 10 degrees of freedom to the left of −t0 = −3.867 and to the right of t0 = 3.867. Using Table VII we find that with 10 degrees of freedom, the value 3.867 is between 3.581 and 4.144 corresponding to right-tail areas of 0.0025 and 0.001, respectively. Thus, the P-value is between 0.002 and 0.005.

Step 4: Since the P-value is less than the level of significance, 0.05, we reject the null hypothesis.



Model (24 of 24)

Solution

Step 5: There is sufficient evidence at the α = 0.05 level of significance to conclude that a linear relation exists between drill depth and drill time.


14.1 Testing the Significance of the Least-Squares Regression Model14.1.5 Construct a Confidence Interval about the Slope of the Least-Squares Regression Model (1 of 5)

Confidence Intervals for the Slope of the Regression Line

A (1 − α) • 100% confidence interval for the slope of the true regression line, β1, is given by the following formulas:



Note: The confidence interval formula for β1 can be computed only if the data are randomly obtained, the residuals are normally distributed, and there is constant error variance.



Parallel Example 7: Constructing a Confidence Interval for the Slope of the True Regression Line

Construct a 95% confidence interval for the slope of the least-squares regression line for the drilling example.



Solution

The requirements for the usage of the confidence interval formula were verified in previous examples.



Solution

Since t0.025 = 2.228 for 10 degrees of freedom, we have

Lower bound = 0.0116 − 2.228 • 0.003 = 0.0049

Upper bound = 0.0116 + 2.228 • 0.003 = 0.0183.

We are 95% confident that the mean increase in the time it takes to drill 5 feet for each additional foot of depth at which the drilling begins is between 0.005 and 0.018 minutes.


14.2 Confidence and Prediction IntervalsLearning Objectives

1. Construct confidence intervals for a mean response

2. Construct prediction intervals for an individual response


14.2 Confidence and Prediction IntervalsIntroduction

Confidence intervals for a mean response are intervals constructed about the predicted value of y, at a given level of x, that are used to measure the accuracy of the mean response of all the individuals in the population.

Prediction intervals for an individual response are intervals constructed about the predicted value of y that are used to measure the accuracy of a single individual’s predicted value.


14.2 Confidence and Prediction Intervals14.2.1 Construct Confidence Intervals for a Mean Response (1 of 5)

where x* is the given value of the explanatory variable, n is the number of observations, and tα/2 is the critical value with n − 2 degrees of freedom.



Parallel Example 1: Constructing a Confidence Interval

for a Mean Response

Construct a 95% confidence interval about the predicted mean time

to drill 5 feet for all drillings started at a depth of 110 feet.



Solution



Solution

Therefore,



Solution

We are 95% confident that the mean time to drill 5 feet for all

drillings started at a depth of 110 feet is between 6.45 and 7.15

minutes.


14.2 Confidence and Prediction Intervals14.2.2 Construct Prediction Intervals for an Individual Response (1 of 5)

where x* is the given value of the explanatory variable, n is the number of observations, and tα/2 is the critical value with n − 2 degrees of freedom.



Parallel Example 2: Constructing a Prediction Interval for

an Individual Response

Construct a 95% prediction interval about the predicted time to drill

5 feet for a single drilling started at a depth of 110 feet.



Solution



Solution

Therefore,



Solution

We are 95% confident that the time to drill 5 feet for a random

drilling started at a depth of 110 feet is between 5.59 and 8.01

minutes.


14.3 Introduction to Multiple RegressionLearning Objectives

1. Obtain the correlation matrix

2. Use technology to find a multiple regression equation

3. Interpret the coefficients of a multiple regression equation

4. Determine R2 and adjusted R2

5. Perform an F-test for lack of fit

6. Test individual regression coefficients for significance

7. Construct confidence and prediction intervals


14.3 Introduction to Multiple Regression14.3.1 Obtain the Correlation Matrix (1 of 8)

where

– yi is the value of the response variable for the ith individual

– x1i is the ith observation for the first explanatory variable, x2i is the ith observation for the second explanatory variable, and so on

– β0, β1,…, βk are the parameters to be estimated based on sample data

– εi is a random error term that is normally distributed with mean 0 and standard deviation

– The error terms are independent, and i = 1,…, n, where n is the sample size.



A correlation matrix shows the linear correlation between each pair of variables under consideration in a multiple regression model.



Multicollinearity exists between two explanatory variables if they have a high linear correlation.



CAUTION!

If two explanatory variables in the regression model are highly correlated with each other, watch out for strange results in the regression output.



Parallel Example 1: Constructing a Correlation Matrix

As cheese ages, various chemical processes take place that determine the taste of the final product. The next two slides give concentrations of various chemicals in 30 samples of mature cheddar cheese and a subjective measure of taste for each sample.

Source: Moore, David S., and George P. McCabe (1989)



Obs Taste In(Acetic) In(H2S) Lactic

1 12.3 4.543 3.135 0.86

2 20.9 5.159 5.043 1.53

3 39 5.366 5.438 1.57

4 47.9 5.759 7.496 1.81

5 5.6 4.663 3.807 0.99

6 25.9 5.697 7.601 1.09

7 37.3 5.892 8.726 1.29

8 21.9 6.078 7.966 1.78

9 18.1 4.898 3.85 1.29

10 21 5.242 4.174 1.58

11 34.9 5.74 6.142 1.68

12 57.2 6.446 7.908 1.9

13 0.7 4.477 2.996 1.06

14 25.9 5.236 4.942 1.3

15 54.9 6.151 6.752 1.52



Obs Taste In(Acetic) In(H2S) Lactic

16 40.9 6.365 9.588 1.74

17 15.9 4.787 3.912 1.16

18 6.4 5.412 4.7 1.49

19 18 5.247 6.174 1.63

20 38.9 5.438 9.064 1.99

21 14 4.564 4.949 1.15

22 15.2 5.298 5.22 1.33

23 32 5.455 9.242 1.44

24 56.7 5.855 10.199 2.01

25 16.8 5.366 3.664 1.31

26 11.6 6.043 3.219 1.46

27 26.5 6.458 6.962 1.72

28 0.7 5.328 3.912 1.25

29 13.4 5.802 6.685 1.08

30 5.5 6.176 4.787 1.25



Solution

The following correlation matrix is from MINITAB:

Correlations: taste, Acetic, H2S, Lactic

blank taste Acetic H2S

Acetic 0.550 blank blank

H2S 0.756 0.618 blank

Lactic 0.704 0.604 0.645


14.3 Introduction to Multiple Regression14.3.2 Use Technology to Find a Multiple Regression Equation (1 of 5)

2. Draw residual plots and a boxplot of the residuals to assess the adequacy of the model.



Solution



Solution

2.



Solution

2. None of the residual plots show any discernible pattern, and the boxplot does not show any outliers. Therefore, the linear model is appropriate.


14.3 Introduction to Multiple Regression14.3.3 Interpret the Coefficients of a Multiple Regression Equation (1 of 6)

Parallel Example 3: Interpreting Regression Coefficients

Interpret the regression coefficients for the least-squares regression equation found in Parallel Example 2.



• Since b1 = 0.328, for every 1 unit increase in the natural logarithm

of acetic acid concentration, the cheese’s taste score will

increase by 0.328, assuming that the hydrogen sulfide and lactic

acid concentrations remain unchanged.



Solution

• Since b2 = 3.912, for every 1 unit increase in the natural logarithm of hydrogen sulfide concentration, the cheese’s taste score will increase by 3.912, assuming that the acetic acid and lactic acid concentrations remain unchanged.

• Since b3 = 19.671, for every 1unit increase in lactic acid concentration, the cheese’s taste score will increase by 19.671, assuming that the hydrogen sulfide and acetic acid concentrations remain unchanged.



If the mean value of the response variable y in a least-squares regression associated with a 1-unit change in an explanatory variable depends on a second explanatory variable, there is interaction between the two explanatory variables. When interaction exists between two explanatory variables, x1 and x2, we introduce a term with the variable x1x2

in the regression model as an explanatory variable.



An indicator (or dummy) variable is a qualitative explanatory variable in a multiple regression model that takes on the value 0 or 1.



In general, if there are c categories for a qualitative explanatory variable, the regression model will require c − 1 indicator variables, each taking on a value of 0 or 1.


14.3 Introduction to Multiple Regression14.3.4 Determine R2 and Adjusted R2 (1 of 6)




The value of R2 always increases by adding one more explanatory variable.



CAUTION!

Never use R2 to compare regression models with a different number of explanatory variables. Rather, use the adjusted R2.



Parallel Example 4: Coefficient of Determination

For the regression model obtained in Parallel Example 2, determine the coefficient of determination and the adjusted R2.



Regression Analysis: taste versus Acetic, H2S, Lactic

The regression equation is taste = – 28.9 + 0.33 Acetic + 3.91 H2S + 19.7 Lactic

Predictor Coef SE Coef T P

Constant – 28.88 19.74 –1.46 0.155

Acetic 0.328 4.460 0.07 0.942

H2S 3.912 1.248 3.13 0.004

Lactic 19.671 8.629 2.28 0.031

S = 10.1307 R–Sq = 65.2 % R–Sq(adj) = 61.2%

Analysis of Variance

Source DF SS MS F P

Regression 3 4994.5 1664.8 0.155 0.000

Residual Error 26 2668.4 102.6 blank blank

Total 29 7662.9 blank blaank blank



Solution


14.3 Introduction to Multiple Regression14.3.5 Perform an F-Test for Lack of Fit (1 of 9)

with k − 1 degrees of freedom in the numerator and n − k degrees of freedom in the denominator, where k is the number of explanatory variables and n is the sample size.



where

R2 is the coefficient of determination

k is the number of explanatory variables

n is the sample size.



Decision Rule for Testing H0: β1 = β2 = ··· = βk = 0

If the P-value is less than the level of significance, α, reject the null hypothesis. Otherwise, do not reject the null hypothesis.




The null hypothesis states that there is no linear relation between the explanatory variables and the response variable. The alternative hypothesis states that there is a linear relation between at least one explanatory variable and the response variable.



Parallel Example 5: Inference on the Regression Model

Test H0: β1 = β2 = β3 = 0 versus H1: at least one βi ≠ 0 for the multiple regression model for the cheese taste data.



Solution

We must first determine whether it is reasonable to believe that the residuals are normally distributed with no outliers.






Constant – 28.88 19.74 –1.46 0.155

Acetic 0.328 4.460 0.07 0.942

H2S 3.912 1.248 3.13 0.004

Lactic 19.671 8.629 2.28 0.031

S = 10.1307 R–Sq = 65.2 % R–Sq(adj) = 61.2%


Source DF SS MS F P

Regression 3 4994.5 1664.8 0.155 0.000


Total 29 7662.9 blank blank blank



Solution

Although there appears to be one outlier, the sample size is large enough for this to not be of great concern. We look at the P-value associated with the F-test statistic from the MINITAB output.

Since the P-value < 0.001, we reject H0 and conclude that at least one of the regression coefficients is different from zero.



CAUTION!

If we reject the null hypothesis that all the slope coefficients are zero, then we are saying that at least one of the slopes is different from zero, not that they all are different from zero.


14.3 Introduction to Multiple Regression14.3.6 Test Individual Regression Coefficients for Significance (1 of 4)

Parallel Example 6: Testing the Significance of Individual Predictor Variables

Test the following hypotheses for the cheese taste data:

a) H0: β1 = 0 versus H1: β1 ≠ 0

b) H0: β2 = 0 versus H1: β2 ≠ 0

c) H0: β3 = 0 versus H1: β3 ≠ 0






Constant – 28.88 19.74 –1.46 0.155

Acetic 0.328 4.460 0.07 0.942

H2S 3.912 1.248 3.13 0.004

Lactic 19.671 8.629 2.28 0.031

S = 10.1307 R–Sq = 65.2 % R–Sq(adj) = 61.2%


Source DF SS MS F P

Regression 3 4994.5 1664.8 0.155 0.000


Total 29 7662.9 blank blank blank



Solution

We will again use the MINITAB output

a) The test statistic for acetic acid is 0.07 with a P-value of 0.942 so we fail to reject H0.

b) The test statistic for hydrogen sulfide is 3.13 with a P-value of 0.004 so we reject H0.

c) The test statistic for lactic acid is 2.28 with a P-value of 0.031 so we reject H0.

We conclude that the natural logarithm of hydrogen sulfide concentration and lactic acid concentration are useful predictors for taste, but the natural logarithm of acetic acid concentration is not.



We refit the model using the natural logarithm of hydrogen sulfide concentration and lactic acid concentration to obtain


14.3 Introduction to Multiple Regression14.3.7 Construct Confidence and Prediction Intervals (1 of 3)

Parallel Example 7: Testing the Significance of Individual Predictor Variables

a) Construct a 95% confidence interval for the mean taste score of all cheddar cheeses whose natural logarithm of hydrogen sulfide concentration is 5.5 and whose lactic acid concentration is 1.75.

b) Construct a 95% prediction interval for the taste score of an individual cheddar cheese whose natural logarithm of hydrogen sulfide concentration is 5.5 and whose lactic acid concentration is 1.75.



Solution

Predicted Values for New Observations

New

Obs Fit SE Fit 95% CI 95% PI

1 28.92 3.34 (22.07, 35.76) (7.40, 50.43)

Values of Predictors for New Observations

New blank blank

Obs H2S Lactic

1 5.50 1.75



Solution

Based on the MINITAB output, we are 95% confident that the mean taste score of all cheddar cheeses with ln(hydrogen sulfide) = 5.5 and a lactic acid concentration of 1.75 is between 22.07 and 35.76. We are 95% confident that the mean taste score of an individual cheddar cheese with ln(hydrogen sulfide) = 5.5 and a lactic acid concentration of 1.75 will be between 7.40 and 50.43.

statistics - site.iugaza.edu.pssite.iugaza.edu.ps/mriffi/files/2018/02/ch14.pdf · 14.1 testing the...

Documents