multicollinearity

46
Multicollinearity

Upload: anisa

Post on 22-Jan-2016

31 views

Category:

Documents


0 download

DESCRIPTION

Multicollinearity. Overview. This set of slides What is multicollinearity? What is its effects on regression analyses? Next set of slides How to detect multicollinearity? How to reduce some types of multicollinearity?. Multicollinearity. - PowerPoint PPT Presentation

TRANSCRIPT

Page 1: Multicollinearity

Multicollinearity

Page 2: Multicollinearity

Overview

• This set of slides– What is multicollinearity?– What is its effects on regression analyses?

• Next set of slides– How to detect multicollinearity?– How to reduce some types of multicollinearity?

Page 3: Multicollinearity

Multicollinearity

• Multicollinearity exists when two or more of the predictors in a regression model are moderately or highly correlated.

• Regression analyses most often take place on data obtained from observational studies.

• In observational studies, multicollinearity happens more often than not.

Page 4: Multicollinearity

Types of multicollinearity

• Structural multicollinearity– a mathematical artifact caused by creating new

predictors from other predictors, such as, creating the predictor x2 from the predictor x.

• Sample-based multicollinearity– a result of a poorly designed experiment,

reliance on purely observational data, or the inability to manipulate the system on which you collect the data

Page 5: Multicollinearity

Example

120

110

53.25

47.75

97.325

89.375

2.125

1.875

8.275

4.425

72.5

65.5

120110

76.25

30.75

53.25

47.75

97.325

89.375 2.1

251.8

758.2

754.4

25 72.565.5

76.25

30.75

BP

Age

Weight

BSA

Duration

Pulse

Stress

n = 20 hypertensive individuals

p-1 = 6 predictor variables

Page 6: Multicollinearity

Example

BP Age Weight BSA Duration PulseAge 0.659Weight 0.950 0.407BSA 0.866 0.378 0.875Duration 0.293 0.344 0.201 0.131Pulse 0.721 0.619 0.659 0.465 0.402Stress 0.164 0.368 0.034 0.018 0.312 0.506

Blood pressure (BP) is the response.

Page 7: Multicollinearity

What is effect on regression analyses if predictors are perfectly uncorrelated?

Page 8: Multicollinearity

Example

x1 x2 y 2 5 52 2 5 51 2 7 49 2 7 46 4 5 50 4 5 48 4 7 44 4 7 43

Pearson correlation of x1 and x2 = 0.000

Page 9: Multicollinearity

Example

3.52.5 6.55.5

49.75

45.25

3.5

2.5

y

x1

x2

Page 10: Multicollinearity

The regression equation is y = 48.8 - 0.63 x1

Predictor Coef SE Coef T PConstant 48.750 4.025 12.11 0.000x1 -0.625 1.273 -0.49 0.641

Analysis of VarianceSource DF SS MS F PRegression 1 3.13 3.13 0.24 0.641Error 6 77.75 12.96Total 7 80.88

Regress y on x1

Page 11: Multicollinearity

Regress y on x2

The regression equation is y = 55.1 - 1.38 x2

Predictor Coef SE Coef T PConstant 55.125 7.119 7.74 0.000x2 -1.375 1.170 -1.17 0.285

Analysis of VarianceSource DF SS MS F PRegression 1 15.13 15.13 1.38 0.285Error 6 65.75 10.96Total 7 80.88

Page 12: Multicollinearity

Regress y on x1 and x2

The regression equation is y = 57.0 - 0.63 x1 - 1.38 x2

Predictor Coef SE Coef T PConstant 57.000 8.486 6.72 0.001x1 -0.625 1.251 -0.50 0.639x2 -1.375 1.251 -1.10 0.322

Analysis of VarianceSource DF SS MS F PRegression 2 18.25 9.13 0.73 0.528Error 5 62.63 12.53Total 7 80.88

Source DF Seq SSx1 1 3.13x2 1 15.13

Page 13: Multicollinearity

Regress y on x2 and x1The regression equation is y = 57.0 - 1.38 x2 - 0.63 x1

Predictor Coef SE Coef T PConstant 57.000 8.486 6.72 0.001x2 -1.375 1.251 -1.10 0.322x1 -0.625 1.251 -0.50 0.639

Analysis of VarianceSource DF SS MS F PRegression 2 18.25 9.13 0.73 0.528Error 5 62.63 12.53Total 7 80.88

Source DF Seq SSx2 1 15.13x1 1 3.13

Page 14: Multicollinearity

Summary of resultsModel b1 se(b1) b2 se(b2) Seq SS

x1 only -0.625 1.273 --- ---SSR(X1)

3.13

x2 only --- --- -1.375 1.170SSR(X2)15.13

x1, x2 (in order)

-0.625 1.251 -1.375 1.251SSR(X2|X1)

15.13

x2, x1 (in order)

-0.625 1.251 -1.375 1.251SSR(X1|X2)

3.13

Page 15: Multicollinearity

If predictors are perfectly uncorrelated, then…

• You get the same slope estimates regardless of the first-order regression model used.

• That is, the effect on the response ascribed to a predictor doesn’t depend on the other predictors in the model.

Page 16: Multicollinearity

If predictors are perfectly uncorrelated, then…

• The sum of squares SSR(X1) is the same as the sequential sum of squares SSR(X1|X2).

• The sum of squares SSR(X2) is the same as the sequential sum of squares SSR(X2|X1).

• That is, the marginal contribution of one predictor variable in reducing the error sum of squares doesn’t depend on the other predictors in the model.

Page 17: Multicollinearity

Do we see similar results for “real data” with nearly uncorrelated predictors?

Page 18: Multicollinearity

Example

BP Age Weight BSA Duration PulseAge 0.659Weight 0.950 0.407BSA 0.866 0.378 0.875Duration 0.293 0.344 0.201 0.131Pulse 0.721 0.619 0.659 0.465 0.402Stress 0.164 0.368 0.034 0.018 0.312 0.506

Page 19: Multicollinearity

Example

2.125

1.875

76.2530.75

120

110

2.125

1.875

BP

BSA

Stress

Page 20: Multicollinearity

Regress BP on x1 = Stress

The regression equation isBP = 113 + 0.0240 Stress

Predictor Coef SE Coef T PConstant 112.720 2.193 51.39 0.000Stress 0.02399 0.03404 0.70 0.490

S = 5.502 R-Sq = 2.7% R-Sq(adj) = 0.0%

Analysis of VarianceSource DF SS MS F PRegression 1 15.04 15.04 0.50 0.490Error 18 544.96 30.28Total 19 560.00

Page 21: Multicollinearity

Regress BP on x2 = BSA

The regression equation isBP = 45.2 + 34.4 BSA

Predictor Coef SE Coef T PConstant 45.183 9.392 4.81 0.000BSA 34.443 4.690 7.34 0.000

S = 2.790 R-Sq = 75.0% R-Sq(adj) = 73.6%

Analysis of VarianceSource DF SS MS F PRegression 1 419.86 419.86 53.93 0.000Error 18 140.14 7.79Total 19 560.00

Page 22: Multicollinearity

Regress BP on x1 = Stress and x2 = BSA

The regression equation isBP = 44.2 + 0.0217 Stress + 34.3 BSA

Predictor Coef SE Coef T PConstant 44.245 9.261 4.78 0.000Stress 0.02166 0.01697 1.28 0.219BSA 34.334 4.611 7.45 0.000

Analysis of VarianceSource DF SS MS F PRegression 2 432.12 216.06 28.72 0.000Error 17 127.88 7.52Total 19 560.00

Source DF Seq SSStress 1 15.04BSA 1 417.07

Page 23: Multicollinearity

Regress BP on x2 = BSA and x1 = Stress

The regression equation is BP = 44.2 + 34.3 BSA + 0.0217 Stress

Predictor Coef SE Coef T PConstant 44.245 9.261 4.78 0.000BSA 34.334 4.611 7.45 0.000Stress 0.02166 0.01697 1.28 0.219

Analysis of VarianceSource DF SS MS F PRegression 2 432.12 216.06 28.72 0.000Error 17 127.88 7.52Total 19 560.00

Source DF Seq SSBSA 1 419.86Stress 1 12.26

Page 24: Multicollinearity

Summary of resultsModel b1 se(b1) b2 se(b2) Seq SS

x1 only 0.0240 0.0340 --- ---SSR(X1) 15.04

x2 only --- --- 34.443 4.690SSR(X2)419.86

x1, x2 (in order)

0.0217 0.0170 34.334 4.611SSR(X2|X1)

417.07

x2, x1 (in order)

0.0217 0.0170 34.334 4.611SSR(X1|X2)

12.26

Page 25: Multicollinearity

If predictors are nearlyuncorrelated, then…

• You get similar slope estimates regardless of the first-order regression model used.

• The sum of squares SSR(X1) is similar to the sequential sum of squares SSR(X1|X2).

• The sum of squares SSR(X2) is similar to the sequential sum of squares SSR(X2|X1).

Page 26: Multicollinearity

What happens if the predictor variables are highly correlated?

Page 27: Multicollinearity

Example

BP Age Weight BSA Duration PulseAge 0.659Weight 0.950 0.407BSA 0.866 0.378 0.875Duration 0.293 0.344 0.201 0.131Pulse 0.721 0.619 0.659 0.465 0.402Stress 0.164 0.368 0.034 0.018 0.312 0.506

Page 28: Multicollinearity

Example

2.125

1.875

97.325

89.375

120

110

2.125

1.875

BP

BSA

Weight

Page 29: Multicollinearity

Regress BP on x1 = Weight

The regression equation isBP = 2.21 + 1.20 Weight

Predictor Coef SE Coef T PConstant 2.205 8.663 0.25 0.802Weight 1.20093 0.09297 12.92 0.000

S = 1.740 R-Sq = 90.3% R-Sq(adj) = 89.7%

Analysis of VarianceSource DF SS MS F PRegression 1 505.47 505.47 166.86 0.000Error 18 54.53 3.03Total 19 560.00

Page 30: Multicollinearity

Regress BP on x2 = BSA

The regression equation isBP = 45.2 + 34.4 BSA

Predictor Coef SE Coef T PConstant 45.183 9.392 4.81 0.000BSA 34.443 4.690 7.34 0.000

S = 2.790 R-Sq = 75.0% R-Sq(adj) = 73.6%

Analysis of VarianceSource DF SS MS F PRegression 1 419.86 419.86 53.93 0.000Error 18 140.14 7.79Total 19 560.00

Page 31: Multicollinearity

Regress BP on x1 = Weight and x2 = BSA

The regression equation isBP = 5.65 + 1.04 Weight + 5.83 BSA

Predictor Coef SE Coef T PConstant 5.653 9.392 0.60 0.555Weight 1.0387 0.1927 5.39 0.000BSA 5.831 6.063 0.96 0.350

Analysis of VarianceSource DF SS MS F PRegression 2 508.29 254.14 83.54 0.000Error 17 51.71 3.04Total 19 560.00

Source DF Seq SSWeight 1 505.47BSA 1 2.81

Page 32: Multicollinearity

Regress BP on x2 = BSA and x1 = Weight

The regression equation isBP = 5.65 + 5.83 BSA + 1.04 Weight

Predictor Coef SE Coef T PConstant 5.653 9.392 0.60 0.555BSA 5.831 6.063 0.96 0.350Weight 1.0387 0.1927 5.39 0.000

Analysis of VarianceSource DF SS MS F PRegression 2 508.29 254.14 83.54 0.000Error 17 51.71 3.04Total 19 560.00

Source DF Seq SSBSA 1 419.86Weight 1 88.43

Page 33: Multicollinearity

Effect #1 of multicollinearity

When predictor variables are correlated, the regression coefficient of any one variable depends on which other predictor variables are included in the model.

Variables in model

b1 b2

x1 1.20 ----

x2 ---- 34.4

x1, x2 1.04 5.83

Page 34: Multicollinearity

Even correlated predictors not in the model can have an impact!

• Regression of territory sales on territory population, per capita income, etc.

• Against expectation, coefficient of territory population was determined to be negative.

• Competitor’s market penetration, which was strongly positively correlated with territory population, was not included in model.

• But, competitor kept sales down in territories with large populations.

Page 35: Multicollinearity

Effect #2 of multicollinearity

When predictor variables are correlated, the precision of the estimated regression coefficients decreases as more predictor variables are added to the model.

Variables in model

se(b1) se(b2)

x1 0.093 ----

x2 ---- 4.69

x1, x2 0.193 6.06

Page 36: Multicollinearity

Why not effects #1 and #2?

4

42

y

3

45

48

x1

51

56 2

7x2

3D Scatterplot of y vs x1 vs x2

Page 37: Multicollinearity

Why not effects #1 and #2?

2.252.10104

BP

112

120

BSA1.95

128

01.8050

100Stress

3D Scatterplot of BP vs BSA vs Stress

Page 38: Multicollinearity

Why effects #1 and #2?

2.252.10104

BP

112

120

BSA1.95

128

85 90 1.8095 100Weight

3D Scatterplot of BP vs BSA vs Weight

Page 39: Multicollinearity

Effect #3 of multicollinearity

When predictor variables are correlated, the marginal contribution of any one predictor variable in reducing the error sum of squares varies, depending on which other variables are already in model.

SSR(X1) = 505.47

SSR(X1|X2) = 88.43

SSR(X2) = 419.86

SSR(X2|X1) = 2.81

Page 40: Multicollinearity

What is the effect on estimating mean or predicting new response?

2.252.152.051.951.851.75

100

95

90

85

BSA

Wei

ght

(2,92)

Page 41: Multicollinearity

Weight Fit SE Fit 95.0% CI 95.0% PI

92 112.7 0.402 (111.85,113.54) (108.94,116.44)

BSA Fit SE Fit 95.0% CI 95.0% PI 2 114.1 0.624 (112.76,115.38) (108.06,120.08)

BSA Weight Fit SE Fit 95.0% CI 95.0% PI

2 92 112.8 0.448 (111.93,113.83) (109.08, 116.68)

Effect #4 of multicollinearity on estimating mean or predicting Y

High multicollinearity among predictor variables does not prevent good, precise predictions of the response (within scope of model).

Page 42: Multicollinearity

The regression equation isBP = 45.2 + 34.4 BSA

Predictor Coef SE Coef T PConstant 45.183 9.392 4.81 0.000BSA 34.443 4.690 7.34 0.000

S = 2.790 R-Sq = 75.0% R-Sq(adj) = 73.6%

Analysis of VarianceSource DF SS MS F PRegression 1 419.86 419.86 53.93 0.000Error 18 140.14 7.79Total 19 560.00

What is effect on tests of individual slopes?

Page 43: Multicollinearity

The regression equation isBP = 2.21 + 1.20 Weight

Predictor Coef SE Coef T PConstant 2.205 8.663 0.25 0.802Weight 1.20093 0.09297 12.92 0.000

S = 1.740 R-Sq = 90.3% R-Sq(adj) = 89.7%

Analysis of VarianceSource DF SS MS F PRegression 1 505.47 505.47 166.86 0.000Error 18 54.53 3.03Total 19 560.00

What is effect on tests of individual slopes?

Page 44: Multicollinearity

What is effect on tests of individual slopes?

The regression equation isBP = 5.65 + 1.04 Weight + 5.83 BSA

Predictor Coef SE Coef T PConstant 5.653 9.392 0.60 0.555Weight 1.0387 0.1927 5.39 0.000BSA 5.831 6.063 0.96 0.350

Analysis of VarianceSource DF SS MS F PRegression 2 508.29 254.14 83.54 0.000Error 17 51.71 3.04Total 19 560.00

Source DF Seq SSWeight 1 505.47BSA 1 2.81

Page 45: Multicollinearity

Effect #5 of multicollinearity on slope tests

When predictor variables are correlated, hypothesis tests for βk = 0 may yield different conclusions depending on which predictor variables are in the model.

Variables in model

b2 se(b2) t P-value

x2 34.4 4.7 7.34 0.000

x1, x2 5.83 6.1 0.96 0.350

Page 46: Multicollinearity

The major impacts on model use

• In the presence of multicollinearity, it is okay to use an estimated regression model to predict y or estimate μY in scope of model.

• In the presence of multicollinearity, we can no longer interpret a slope coefficient as …– the change in the mean response for each

additional unit increase in xk, when all the other predictors are held constant