multiple regression continued… stat e-150 statistical methods

46
Multiple Regression continued… STAT E-150 Statistical Methods

Upload: florence-higgins

Post on 23-Dec-2015

222 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Multiple Regression continued… STAT E-150 Statistical Methods

Multiple Regressioncontinued…

STAT E-150Statistical Methods

Page 2: Multiple Regression continued… STAT E-150 Statistical Methods

2

When we discussed simple linear regression, we briefly introduced prediction intervals and confidence intervals:

Confidence Intervals and Prediction Intervals 

Let x be a specific value of x. The predicted value of y is

 We can create two different intervals:

  a prediction interval for an individual value of x

a confidence interval for the mean predicted value at x

Page 3: Multiple Regression continued… STAT E-150 Statistical Methods

3

The basic format for an interval is

  When we want to find a mean predicted value,

 

When we want to find an individual predicted value,

Page 4: Multiple Regression continued… STAT E-150 Statistical Methods

4

Let us return to our earlier discussion of the age of adolescent mothers and the weight of their babies. We found that there was a linear relationship between these variables:

weight = 245.15 age – 1163.45

 

How can we use this model to make predictions?

Page 5: Multiple Regression continued… STAT E-150 Statistical Methods

5

Suppose we want to predict the weight of a baby born to a mother who is 16 years old. When we analyze the data, we can choose to save the predicted values, the confidence interval and the prediction interval for each predictor value. The results will appear in the datasheet:

x-value predicted 95% CI 95% CI y-value confidence interval prediction interval

Page 6: Multiple Regression continued… STAT E-150 Statistical Methods

6

What weight is expected for a baby of a 16 year old mother?

Page 7: Multiple Regression continued… STAT E-150 Statistical Methods

7

What weight is expected for a baby of a 16 year old mother? 2759 g

Page 8: Multiple Regression continued… STAT E-150 Statistical Methods

8

What is the prediction interval estimate for the weight of a baby of a 16 year old mother?

Page 9: Multiple Regression continued… STAT E-150 Statistical Methods

9

What is the prediction interval estimate for the weight of a baby of a 16 year old mother? 2251.24 to 3266.66 g

What does it tell you? We are 95% confident that the birthweight of a baby born to a 16 year old mother is between 2575.59 and 2942.31 g.

Page 10: Multiple Regression continued… STAT E-150 Statistical Methods

10

What is the prediction interval estimate for the weight of a baby of a 16 year old mother? 2251.24 to 3266.66 g

What does it tell you? We are 95% confident that the birthweight of a baby born to a 16 year old mother is between 2251.24 and 3266.66 g.

Page 11: Multiple Regression continued… STAT E-150 Statistical Methods

11

What is the confidence interval estimate for the mean weight of babies of 16 year old mothers?

Page 12: Multiple Regression continued… STAT E-150 Statistical Methods

12

What is the confidence interval estimate for the mean weight of babies of 16 year old mothers? 2575.59 to 2942.31 g

What does it tell you? We are 95% confident

Page 13: Multiple Regression continued… STAT E-150 Statistical Methods

13

What is the confidence interval estimate for the mean weight of babies of 16 year old mothers? 2575.59 to 2942.31 g

What does it tell you? We are 95% confident that the mean birthweight of babies born to 16 year old mothers is between 2575.59 and 2942.31 g.We are 95% confident

Page 14: Multiple Regression continued… STAT E-150 Statistical Methods

14

The 95% confidence interval is (2575.59, 2942.31)

The 95% prediction interval is (2251.24, 3266.66)

Which is interval is wider? Why?

Page 15: Multiple Regression continued… STAT E-150 Statistical Methods

15

The 95% confidence interval is (2575.59, 2942.31)

The 95% prediction interval is (2251.24, 3266.66)

Which is interval is wider? Why?

The prediction interval is wider, because means vary less than individual values.

Page 16: Multiple Regression continued… STAT E-150 Statistical Methods

16

In the data concerning body fat percentages in men, the predictor variables were waist and height, and we found a regression equation which we can now use to make predictions:

%BodyFat = 1.773 waist - .601 height – 3.110 We can find prediction intervals and confidence intervals as we did when we used a single predictor.

Page 17: Multiple Regression continued… STAT E-150 Statistical Methods

17

Suppose we want to predict the body fat percentage associated with a waist size of 34 inches and a height of 6 feet. We can proceed as we did with a single predictor, by entering these values in the data window, and then saving the results of the linear regression analysis.

Page 18: Multiple Regression continued… STAT E-150 Statistical Methods

18

When you scroll to the right, you will see these results:

What is the predicted body fat %?

Page 19: Multiple Regression continued… STAT E-150 Statistical Methods

19

When you scroll to the right, you will see these results:

What is the predicted body fat %? 13.874%

Page 20: Multiple Regression continued… STAT E-150 Statistical Methods

20

When you scroll to the right, you will see these results:

What is the prediction interval? What does it tell you?

Page 21: Multiple Regression continued… STAT E-150 Statistical Methods

21

When you scroll to the right, you will see these results:

What is the prediction interval? What does it tell you?

The 95% prediction interval is (5.05, 22.69)

Page 22: Multiple Regression continued… STAT E-150 Statistical Methods

22

When you scroll to the right, you will see these results:

What is the prediction interval? What does it tell you?

We are 95% confident that a man who is 6 feet tall and has a 34 inch waist will have a body fat percentage between 5.05 and 22.69.

Page 23: Multiple Regression continued… STAT E-150 Statistical Methods

23

When you scroll to the right, you will see these results:

What is the confidence interval? What does it tell you?

Page 24: Multiple Regression continued… STAT E-150 Statistical Methods

24

When you scroll to the right, you will see these results:

What is the confidence interval? What does it tell you?

The 95% confidence interval is (13.10, 14.65)

Page 25: Multiple Regression continued… STAT E-150 Statistical Methods

25

When you scroll to the right, you will see these results:

What is the confidence interval? What does it tell you?

We are 95% confident that the mean body fat percentage for men who are 6 feet tall and have a 34 inch waist is between 13.10 and 14.65.

Page 26: Multiple Regression continued… STAT E-150 Statistical Methods

26

Models with Categorical Predictors

Categorical (or qualitative) variables can also be included in multiple regression models. These variables are coded as numbers so that we can employ the methods we have discussed. These coded values are called indicator variables or dummy variables.

They are often coded using 0 and 1, where   0 = absence or 0 = "no"

1 = presence 1 = "yes"

Page 27: Multiple Regression continued… STAT E-150 Statistical Methods

27

Example: One way colleges measure success is by graduation rates. The Education Trust publishes 6-year graduation rates along with other college characteristics on its website, www.collegeresults.org.

Page 28: Multiple Regression continued… STAT E-150 Statistical Methods

28

Here is a sample of the data, which represents a random sample of 22 colleges selected from the 1037 colleges in the United States with enrollments under 5000 students:

Page 29: Multiple Regression continued… STAT E-150 Statistical Methods

29

We define these variables:

y = 6-year graduation ratex1 = median SAT score of students accepted to the college x2 = student-related expense per full-time student (in dollars)

Page 30: Multiple Regression continued… STAT E-150 Statistical Methods

30

The regression model is y = β0 + β1x1 + β2x2 + β3x3 + ε

For single-sex colleges:

Rate = β0 + β1 SAT + β2 Expense + β3(1) = β0 + β1 SAT + β2 Expense + β3 + ε

For coeducational colleges:

Rate = β0 + β1 SAT + β2 Expense + β3(0) = β0 + β1 SAT + β2 Expense + ε

In either case, the slopes are determined using data from both types of colleges.

Page 31: Multiple Regression continued… STAT E-150 Statistical Methods

31

For single-sex colleges, the intercept is β0 + β3:

Rate = β0 + β1 SAT + β2 Expense + β3(1) = β0 + β1 SAT + β2 Expense + β3 + ε = (β0 + β3) + β1 SAT + β2 Expense + ε

For coeducational colleges: Rate = β0 + β1 SAT + β2 Expense + β3(0) = β0 + β1 SAT + β2 Expense + ε

In other words, the coefficient of the indicator variable represents the difference in intercepts for the regression lines for the two types of colleges.

Page 32: Multiple Regression continued… STAT E-150 Statistical Methods

32

What are the hypotheses?H0: β1 = β2 = β3 = 0Ha: The coefficients are not all zero

Page 33: Multiple Regression continued… STAT E-150 Statistical Methods

33

What are the hypotheses?

H0: β1 = β2 = β3 = 0Ha: The coefficients are not all zero

Page 34: Multiple Regression continued… STAT E-150 Statistical Methods

ANOVAb

Model Sum of Squares df Mean Square F Sig.

1 Regression .795 3 .265 37.164 .000a

Residual .128 18 .007

Total .923 21

a. Predictors: (Constant), x3, x1, x2

b. Dependent Variable: y

Coefficientsa

Model

Unstandardized Coefficients

Standardized

Coefficients

t Sig. B Std. Error Beta

1 (Constant) -.391 .198 -1.977 .064

x1 .001 .000 .608 3.305 .004

x2 6.969E-6 .000 .297 1.547 .139

x3 .125 .059 .209 2.102 .050

a. Dependent Variable: y

34

Here is part of the SPSS analysis:

What is your conclusion?

Page 35: Multiple Regression continued… STAT E-150 Statistical Methods

35

ANOVAb

Model Sum of Squares df Mean Square F Sig.

1 Regression .795 3 .265 37.164 .000a

Residual .128 18 .007

Total .923 21

a. Predictors: (Constant), x3, x1, x2

b. Dependent Variable: y

Coefficientsa

Model

Unstandardized Coefficients

Standardized

Coefficients

t Sig. B Std. Error Beta

1 (Constant) -.391 .198 -1.977 .064

x1 .001 .000 .608 3.305 .004

x2 6.969E-6 .000 .297 1.547 .139

x3 .125 .059 .209 2.102 .050

a. Dependent Variable: y

What is your conclusion?Since F is large and p is close to 0, the null hypothesis is rejected.We can conclude that there is a linear relationship between the 6-year graduation rate and the median SAT score , the student-related expense per full-time student, and the gender of the student body.

Page 36: Multiple Regression continued… STAT E-150 Statistical Methods

36

ANOVAb

Model Sum of Squares df Mean Square F Sig.

1 Regression .795 3 .265 37.164 .000a

Residual .128 18 .007

Total .923 21

a. Predictors: (Constant), x3, x1, x2

b. Dependent Variable: y

Coefficientsa

Model

Unstandardized Coefficients

Standardized

Coefficients

t Sig. B Std. Error Beta

1 (Constant) -.391 .198 -1.977 .064

x1 .001 .000 .608 3.305 .004

x2 6.969E-6 .000 .297 1.547 .139

x3 .125 .059 .209 2.102 .050

a. Dependent Variable: y

What is the regression equation?

Page 37: Multiple Regression continued… STAT E-150 Statistical Methods

37

ANOVAb

Model Sum of Squares df Mean Square F Sig.

1 Regression .795 3 .265 37.164 .000a

Residual .128 18 .007

Total .923 21

a. Predictors: (Constant), x3, x1, x2

b. Dependent Variable: y

Coefficientsa

Model

Unstandardized Coefficients

Standardized

Coefficients

t Sig. B Std. Error Beta

1 (Constant) -.391 .198 -1.977 .064

x1 .001 .000 .608 3.305 .004

x2 6.969E-6 .000 .297 1.547 .139

x3 .125 .059 .209 2.102 .050

a. Dependent Variable: y

What is the regression equation?y = .001x1 + .00000697x2 + .125x3 - .391

Page 38: Multiple Regression continued… STAT E-150 Statistical Methods

38

ANOVAb

Model Sum of Squares df Mean Square F Sig.

1 Regression .795 3 .265 37.164 .000a

Residual .128 18 .007

Total .923 21

a. Predictors: (Constant), x3, x1, x2

b. Dependent Variable: y

Coefficientsa

Model

Unstandardized Coefficients

Standardized

Coefficients

t Sig. B Std. Error Beta

1 (Constant) -.391 .198 -1.977 .064

x1 .001 .000 .608 3.305 .004

x2 6.969E-6 .000 .297 1.547 .139

x3 .125 .059 .209 2.102 .050

a. Dependent Variable: y

For single-sex colleges:y = .001x1 + .00000697x2 + .125(1) - .391

y = .001x1 + .00000697x2 - .266

Page 39: Multiple Regression continued… STAT E-150 Statistical Methods

39

ANOVAb

Model Sum of Squares df Mean Square F Sig.

1 Regression .795 3 .265 37.164 .000a

Residual .128 18 .007

Total .923 21

a. Predictors: (Constant), x3, x1, x2

b. Dependent Variable: y

Coefficientsa

Model

Unstandardized Coefficients

Standardized

Coefficients

t Sig. B Std. Error Beta

1 (Constant) -.391 .198 -1.977 .064

x1 .001 .000 .608 3.305 .004

x2 6.969E-6 .000 .297 1.547 .139

x3 .125 .059 .209 2.102 .050

a. Dependent Variable: y

For coed colleges:y = .001x1 + .00000697x2 - .391

Page 40: Multiple Regression continued… STAT E-150 Statistical Methods

40

ANOVAb

Model Sum of Squares df Mean Square F Sig.

1 Regression .795 3 .265 37.164 .000a

Residual .128 18 .007

Total .923 21

a. Predictors: (Constant), x3, x1, x2

b. Dependent Variable: y

Coefficientsa

Model

Unstandardized Coefficients

Standardized

Coefficients

t Sig. B Std. Error Beta

1 (Constant) -.391 .198 -1.977 .064

x1 .001 .000 .608 3.305 .004

x2 6.969E-6 .000 .297 1.547 .139

x3 .125 .059 .209 2.102 .050

a. Dependent Variable: y

What is the meaning of the coefficient β3?We can interpret the value .125 as the “correction” we would maketo the predicted graduation rate to incorporate the difference associated with having only male or only female students.

Page 41: Multiple Regression continued… STAT E-150 Statistical Methods

41

ANOVAb

Model Sum of Squares df Mean Square F Sig.

1 Regression .795 3 .265 37.164 .000a

Residual .128 18 .007

Total .923 21

a. Predictors: (Constant), x3, x1, x2

b. Dependent Variable: y

Coefficientsa

Model

Unstandardized Coefficients

Standardized

Coefficients

t Sig. B Std. Error Beta

1 (Constant) -.391 .198 -1.977 .064

x1 .001 .000 .608 3.305 .004

x2 6.969E-6 .000 .297 1.547 .139

x3 .125 .059 .209 2.102 .050

a. Dependent Variable: y

What is the meaning of the coefficient β3?We can interpret the value .125 as the difference in interceptsfor the two different types of colleges.

Page 42: Multiple Regression continued… STAT E-150 Statistical Methods

42

Interaction and Collinearity

If the change in the mean y-value associated with a 1-unit increase in one predictor variable depends on the value of a second predictor variable, there is interaction between the two predictor variables. If we represent the variables as x1 and x2, the interaction can be modeled by including their product, x1x2, as a predictor variable.

Page 43: Multiple Regression continued… STAT E-150 Statistical Methods

43

Interaction and Collinearity

The regression model for two predictor variables would now include a cross-product term:

Y = β0 + β1x1 + β2x2 + β3x1x2 +ε

where β1 + β3x2 represents the change in Y for every one-unit increase in x1,

keeping x2 fixed

β2 + β3x1 represents the change in Y for every one-unit increase in x2, keeping x1 fixed

If you find that there is a linear association, be sure to check the coefficient of the interaction term.

Page 44: Multiple Regression continued… STAT E-150 Statistical Methods

44

We determine collinearity by examining a correlation matrix:

What is the correlation between Pct BF and Height? -.029 Is this value significant? No; p=.322Pct BF and Waist? Is this value significant?Height and Waist? Is this value significant?

Correlations

  Height Waist

Pearson Correlation Pct BF -.029 .824

Height 1.000 .187

Waist .187 1.000

Sig. (1-tailed) Pct BF .322 .000

Height . .002

Waist .002 .

N Pct BF 250 250

Height 250 250

Waist 250 250

Page 45: Multiple Regression continued… STAT E-150 Statistical Methods

45

We determine collinearity by examining a correlation matrix:

What is the correlation between Pct BF and Height? -.029 Is this value significant? No; p = .322Pct BF and Waist? .824 Is this value significant? Yes; p = .000Height and Waist? .187 Is this value significant? Yes; p = .002

It is important to note that this information only refers to the pair of variables in question, without regard to the influences of other variables.

Correlations

  Height Waist

Pearson Correlation Pct BF -.029 .824

Height 1.000 .187

Waist .187 1.000

Sig. (1-tailed) Pct BF .322 .000

Height . .002

Waist .002 .

N Pct BF 250 250

Height 250 250

Waist 250 250

Page 46: Multiple Regression continued… STAT E-150 Statistical Methods

46

Another way to assess collinearity:

VIF is the Variance Inflation Factor, which indicates whether a predictor has a strong linear relationship with the other predictors. There is reason for concern if the largest VIF is greater than 5.

The Tolerance statistic is the reciprocal of the VIF. There is a serious problem if this value is less than .2.

Coefficientsa

Model

Unstandardized Coefficients

Standardized

Coefficients

t Sig.

Collinearity Statistics

B Std. Error Beta Tolerance VIF

1 (Constant) -3.110 7.687   -.405 .686    

Waist 1.773 .072 .859 24.768 .000 .965 1.036

Height -.601 .110 -.190 -5.470 .000 .965 1.036

a. Dependent Variable: Pct BF