multicollinearity
DESCRIPTION
Multicollinearity. Overview. This set of slides What is multicollinearity? What is its effects on regression analyses? Next set of slides How to detect multicollinearity? How to reduce some types of multicollinearity?. Multicollinearity. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/1.jpg)
Multicollinearity
![Page 2: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/2.jpg)
Overview
• This set of slides– What is multicollinearity?– What is its effects on regression analyses?
• Next set of slides– How to detect multicollinearity?– How to reduce some types of multicollinearity?
![Page 3: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/3.jpg)
Multicollinearity
• Multicollinearity exists when two or more of the predictors in a regression model are moderately or highly correlated.
• Regression analyses most often take place on data obtained from observational studies.
• In observational studies, multicollinearity happens more often than not.
![Page 4: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/4.jpg)
Types of multicollinearity
• Structural multicollinearity– a mathematical artifact caused by creating new
predictors from other predictors, such as, creating the predictor x2 from the predictor x.
• Sample-based multicollinearity– a result of a poorly designed experiment,
reliance on purely observational data, or the inability to manipulate the system on which you collect the data
![Page 5: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/5.jpg)
Example
120
110
53.25
47.75
97.325
89.375
2.125
1.875
8.275
4.425
72.5
65.5
120110
76.25
30.75
53.25
47.75
97.325
89.375 2.1
251.8
758.2
754.4
25 72.565.5
76.25
30.75
BP
Age
Weight
BSA
Duration
Pulse
Stress
n = 20 hypertensive individuals
p-1 = 6 predictor variables
![Page 6: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/6.jpg)
Example
BP Age Weight BSA Duration PulseAge 0.659Weight 0.950 0.407BSA 0.866 0.378 0.875Duration 0.293 0.344 0.201 0.131Pulse 0.721 0.619 0.659 0.465 0.402Stress 0.164 0.368 0.034 0.018 0.312 0.506
Blood pressure (BP) is the response.
![Page 7: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/7.jpg)
What is effect on regression analyses if predictors are perfectly uncorrelated?
![Page 8: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/8.jpg)
Example
x1 x2 y 2 5 52 2 5 51 2 7 49 2 7 46 4 5 50 4 5 48 4 7 44 4 7 43
Pearson correlation of x1 and x2 = 0.000
![Page 9: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/9.jpg)
Example
3.52.5 6.55.5
49.75
45.25
3.5
2.5
y
x1
x2
![Page 10: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/10.jpg)
The regression equation is y = 48.8 - 0.63 x1
Predictor Coef SE Coef T PConstant 48.750 4.025 12.11 0.000x1 -0.625 1.273 -0.49 0.641
Analysis of VarianceSource DF SS MS F PRegression 1 3.13 3.13 0.24 0.641Error 6 77.75 12.96Total 7 80.88
Regress y on x1
![Page 11: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/11.jpg)
Regress y on x2
The regression equation is y = 55.1 - 1.38 x2
Predictor Coef SE Coef T PConstant 55.125 7.119 7.74 0.000x2 -1.375 1.170 -1.17 0.285
Analysis of VarianceSource DF SS MS F PRegression 1 15.13 15.13 1.38 0.285Error 6 65.75 10.96Total 7 80.88
![Page 12: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/12.jpg)
Regress y on x1 and x2
The regression equation is y = 57.0 - 0.63 x1 - 1.38 x2
Predictor Coef SE Coef T PConstant 57.000 8.486 6.72 0.001x1 -0.625 1.251 -0.50 0.639x2 -1.375 1.251 -1.10 0.322
Analysis of VarianceSource DF SS MS F PRegression 2 18.25 9.13 0.73 0.528Error 5 62.63 12.53Total 7 80.88
Source DF Seq SSx1 1 3.13x2 1 15.13
![Page 13: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/13.jpg)
Regress y on x2 and x1The regression equation is y = 57.0 - 1.38 x2 - 0.63 x1
Predictor Coef SE Coef T PConstant 57.000 8.486 6.72 0.001x2 -1.375 1.251 -1.10 0.322x1 -0.625 1.251 -0.50 0.639
Analysis of VarianceSource DF SS MS F PRegression 2 18.25 9.13 0.73 0.528Error 5 62.63 12.53Total 7 80.88
Source DF Seq SSx2 1 15.13x1 1 3.13
![Page 14: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/14.jpg)
Summary of resultsModel b1 se(b1) b2 se(b2) Seq SS
x1 only -0.625 1.273 --- ---SSR(X1)
3.13
x2 only --- --- -1.375 1.170SSR(X2)15.13
x1, x2 (in order)
-0.625 1.251 -1.375 1.251SSR(X2|X1)
15.13
x2, x1 (in order)
-0.625 1.251 -1.375 1.251SSR(X1|X2)
3.13
![Page 15: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/15.jpg)
If predictors are perfectly uncorrelated, then…
• You get the same slope estimates regardless of the first-order regression model used.
• That is, the effect on the response ascribed to a predictor doesn’t depend on the other predictors in the model.
![Page 16: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/16.jpg)
If predictors are perfectly uncorrelated, then…
• The sum of squares SSR(X1) is the same as the sequential sum of squares SSR(X1|X2).
• The sum of squares SSR(X2) is the same as the sequential sum of squares SSR(X2|X1).
• That is, the marginal contribution of one predictor variable in reducing the error sum of squares doesn’t depend on the other predictors in the model.
![Page 17: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/17.jpg)
Do we see similar results for “real data” with nearly uncorrelated predictors?
![Page 18: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/18.jpg)
Example
BP Age Weight BSA Duration PulseAge 0.659Weight 0.950 0.407BSA 0.866 0.378 0.875Duration 0.293 0.344 0.201 0.131Pulse 0.721 0.619 0.659 0.465 0.402Stress 0.164 0.368 0.034 0.018 0.312 0.506
![Page 19: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/19.jpg)
Example
2.125
1.875
76.2530.75
120
110
2.125
1.875
BP
BSA
Stress
![Page 20: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/20.jpg)
Regress BP on x1 = Stress
The regression equation isBP = 113 + 0.0240 Stress
Predictor Coef SE Coef T PConstant 112.720 2.193 51.39 0.000Stress 0.02399 0.03404 0.70 0.490
S = 5.502 R-Sq = 2.7% R-Sq(adj) = 0.0%
Analysis of VarianceSource DF SS MS F PRegression 1 15.04 15.04 0.50 0.490Error 18 544.96 30.28Total 19 560.00
![Page 21: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/21.jpg)
Regress BP on x2 = BSA
The regression equation isBP = 45.2 + 34.4 BSA
Predictor Coef SE Coef T PConstant 45.183 9.392 4.81 0.000BSA 34.443 4.690 7.34 0.000
S = 2.790 R-Sq = 75.0% R-Sq(adj) = 73.6%
Analysis of VarianceSource DF SS MS F PRegression 1 419.86 419.86 53.93 0.000Error 18 140.14 7.79Total 19 560.00
![Page 22: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/22.jpg)
Regress BP on x1 = Stress and x2 = BSA
The regression equation isBP = 44.2 + 0.0217 Stress + 34.3 BSA
Predictor Coef SE Coef T PConstant 44.245 9.261 4.78 0.000Stress 0.02166 0.01697 1.28 0.219BSA 34.334 4.611 7.45 0.000
Analysis of VarianceSource DF SS MS F PRegression 2 432.12 216.06 28.72 0.000Error 17 127.88 7.52Total 19 560.00
Source DF Seq SSStress 1 15.04BSA 1 417.07
![Page 23: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/23.jpg)
Regress BP on x2 = BSA and x1 = Stress
The regression equation is BP = 44.2 + 34.3 BSA + 0.0217 Stress
Predictor Coef SE Coef T PConstant 44.245 9.261 4.78 0.000BSA 34.334 4.611 7.45 0.000Stress 0.02166 0.01697 1.28 0.219
Analysis of VarianceSource DF SS MS F PRegression 2 432.12 216.06 28.72 0.000Error 17 127.88 7.52Total 19 560.00
Source DF Seq SSBSA 1 419.86Stress 1 12.26
![Page 24: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/24.jpg)
Summary of resultsModel b1 se(b1) b2 se(b2) Seq SS
x1 only 0.0240 0.0340 --- ---SSR(X1) 15.04
x2 only --- --- 34.443 4.690SSR(X2)419.86
x1, x2 (in order)
0.0217 0.0170 34.334 4.611SSR(X2|X1)
417.07
x2, x1 (in order)
0.0217 0.0170 34.334 4.611SSR(X1|X2)
12.26
![Page 25: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/25.jpg)
If predictors are nearlyuncorrelated, then…
• You get similar slope estimates regardless of the first-order regression model used.
• The sum of squares SSR(X1) is similar to the sequential sum of squares SSR(X1|X2).
• The sum of squares SSR(X2) is similar to the sequential sum of squares SSR(X2|X1).
![Page 26: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/26.jpg)
What happens if the predictor variables are highly correlated?
![Page 27: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/27.jpg)
Example
BP Age Weight BSA Duration PulseAge 0.659Weight 0.950 0.407BSA 0.866 0.378 0.875Duration 0.293 0.344 0.201 0.131Pulse 0.721 0.619 0.659 0.465 0.402Stress 0.164 0.368 0.034 0.018 0.312 0.506
![Page 28: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/28.jpg)
Example
2.125
1.875
97.325
89.375
120
110
2.125
1.875
BP
BSA
Weight
![Page 29: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/29.jpg)
Regress BP on x1 = Weight
The regression equation isBP = 2.21 + 1.20 Weight
Predictor Coef SE Coef T PConstant 2.205 8.663 0.25 0.802Weight 1.20093 0.09297 12.92 0.000
S = 1.740 R-Sq = 90.3% R-Sq(adj) = 89.7%
Analysis of VarianceSource DF SS MS F PRegression 1 505.47 505.47 166.86 0.000Error 18 54.53 3.03Total 19 560.00
![Page 30: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/30.jpg)
Regress BP on x2 = BSA
The regression equation isBP = 45.2 + 34.4 BSA
Predictor Coef SE Coef T PConstant 45.183 9.392 4.81 0.000BSA 34.443 4.690 7.34 0.000
S = 2.790 R-Sq = 75.0% R-Sq(adj) = 73.6%
Analysis of VarianceSource DF SS MS F PRegression 1 419.86 419.86 53.93 0.000Error 18 140.14 7.79Total 19 560.00
![Page 31: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/31.jpg)
Regress BP on x1 = Weight and x2 = BSA
The regression equation isBP = 5.65 + 1.04 Weight + 5.83 BSA
Predictor Coef SE Coef T PConstant 5.653 9.392 0.60 0.555Weight 1.0387 0.1927 5.39 0.000BSA 5.831 6.063 0.96 0.350
Analysis of VarianceSource DF SS MS F PRegression 2 508.29 254.14 83.54 0.000Error 17 51.71 3.04Total 19 560.00
Source DF Seq SSWeight 1 505.47BSA 1 2.81
![Page 32: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/32.jpg)
Regress BP on x2 = BSA and x1 = Weight
The regression equation isBP = 5.65 + 5.83 BSA + 1.04 Weight
Predictor Coef SE Coef T PConstant 5.653 9.392 0.60 0.555BSA 5.831 6.063 0.96 0.350Weight 1.0387 0.1927 5.39 0.000
Analysis of VarianceSource DF SS MS F PRegression 2 508.29 254.14 83.54 0.000Error 17 51.71 3.04Total 19 560.00
Source DF Seq SSBSA 1 419.86Weight 1 88.43
![Page 33: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/33.jpg)
Effect #1 of multicollinearity
When predictor variables are correlated, the regression coefficient of any one variable depends on which other predictor variables are included in the model.
Variables in model
b1 b2
x1 1.20 ----
x2 ---- 34.4
x1, x2 1.04 5.83
![Page 34: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/34.jpg)
Even correlated predictors not in the model can have an impact!
• Regression of territory sales on territory population, per capita income, etc.
• Against expectation, coefficient of territory population was determined to be negative.
• Competitor’s market penetration, which was strongly positively correlated with territory population, was not included in model.
• But, competitor kept sales down in territories with large populations.
![Page 35: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/35.jpg)
Effect #2 of multicollinearity
When predictor variables are correlated, the precision of the estimated regression coefficients decreases as more predictor variables are added to the model.
Variables in model
se(b1) se(b2)
x1 0.093 ----
x2 ---- 4.69
x1, x2 0.193 6.06
![Page 36: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/36.jpg)
Why not effects #1 and #2?
4
42
y
3
45
48
x1
51
56 2
7x2
3D Scatterplot of y vs x1 vs x2
![Page 37: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/37.jpg)
Why not effects #1 and #2?
2.252.10104
BP
112
120
BSA1.95
128
01.8050
100Stress
3D Scatterplot of BP vs BSA vs Stress
![Page 38: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/38.jpg)
Why effects #1 and #2?
2.252.10104
BP
112
120
BSA1.95
128
85 90 1.8095 100Weight
3D Scatterplot of BP vs BSA vs Weight
![Page 39: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/39.jpg)
Effect #3 of multicollinearity
When predictor variables are correlated, the marginal contribution of any one predictor variable in reducing the error sum of squares varies, depending on which other variables are already in model.
SSR(X1) = 505.47
SSR(X1|X2) = 88.43
SSR(X2) = 419.86
SSR(X2|X1) = 2.81
![Page 40: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/40.jpg)
What is the effect on estimating mean or predicting new response?
2.252.152.051.951.851.75
100
95
90
85
BSA
Wei
ght
(2,92)
![Page 41: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/41.jpg)
Weight Fit SE Fit 95.0% CI 95.0% PI
92 112.7 0.402 (111.85,113.54) (108.94,116.44)
BSA Fit SE Fit 95.0% CI 95.0% PI 2 114.1 0.624 (112.76,115.38) (108.06,120.08)
BSA Weight Fit SE Fit 95.0% CI 95.0% PI
2 92 112.8 0.448 (111.93,113.83) (109.08, 116.68)
Effect #4 of multicollinearity on estimating mean or predicting Y
High multicollinearity among predictor variables does not prevent good, precise predictions of the response (within scope of model).
![Page 42: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/42.jpg)
The regression equation isBP = 45.2 + 34.4 BSA
Predictor Coef SE Coef T PConstant 45.183 9.392 4.81 0.000BSA 34.443 4.690 7.34 0.000
S = 2.790 R-Sq = 75.0% R-Sq(adj) = 73.6%
Analysis of VarianceSource DF SS MS F PRegression 1 419.86 419.86 53.93 0.000Error 18 140.14 7.79Total 19 560.00
What is effect on tests of individual slopes?
![Page 43: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/43.jpg)
The regression equation isBP = 2.21 + 1.20 Weight
Predictor Coef SE Coef T PConstant 2.205 8.663 0.25 0.802Weight 1.20093 0.09297 12.92 0.000
S = 1.740 R-Sq = 90.3% R-Sq(adj) = 89.7%
Analysis of VarianceSource DF SS MS F PRegression 1 505.47 505.47 166.86 0.000Error 18 54.53 3.03Total 19 560.00
What is effect on tests of individual slopes?
![Page 44: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/44.jpg)
What is effect on tests of individual slopes?
The regression equation isBP = 5.65 + 1.04 Weight + 5.83 BSA
Predictor Coef SE Coef T PConstant 5.653 9.392 0.60 0.555Weight 1.0387 0.1927 5.39 0.000BSA 5.831 6.063 0.96 0.350
Analysis of VarianceSource DF SS MS F PRegression 2 508.29 254.14 83.54 0.000Error 17 51.71 3.04Total 19 560.00
Source DF Seq SSWeight 1 505.47BSA 1 2.81
![Page 45: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/45.jpg)
Effect #5 of multicollinearity on slope tests
When predictor variables are correlated, hypothesis tests for βk = 0 may yield different conclusions depending on which predictor variables are in the model.
Variables in model
b2 se(b2) t P-value
x2 34.4 4.7 7.34 0.000
x1, x2 5.83 6.1 0.96 0.350
![Page 46: Multicollinearity](https://reader035.vdocument.in/reader035/viewer/2022062409/56815156550346895dbf79c2/html5/thumbnails/46.jpg)
The major impacts on model use
• In the presence of multicollinearity, it is okay to use an estimated regression model to predict y or estimate μY in scope of model.
• In the presence of multicollinearity, we can no longer interpret a slope coefficient as …– the change in the mean response for each
additional unit increase in xk, when all the other predictors are held constant