outliers and influential data points
DESCRIPTION
Outliers and influential data points. No outliers?. An outlier? Influential?. An outlier? Influential?. An outlier? Influential?. An outlier? Influential?. An outlier? Influential?. An outlier? Influential?. Impact on regression analyses. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/1.jpg)
Outliers and influential data points
![Page 2: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/2.jpg)
No outliers?
14121086420
70
60
50
40
30
20
10
0
x
y
![Page 3: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/3.jpg)
An outlier? Influential?
14121086420
70
60
50
40
30
20
10
0
x
y
![Page 4: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/4.jpg)
An outlier? Influential?
14121086420
70
60
50
40
30
20
10
0
x
y
y = 1.73 + 5.12 x
y = 2.96 + 5.04 x
![Page 5: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/5.jpg)
An outlier? Influential?
14121086420
70
60
50
40
30
20
10
0
x
y
![Page 6: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/6.jpg)
An outlier? Influential?
14121086420
70
60
50
40
30
20
10
0
x
y
y = 1.73 + 5.12 x
y = 2.47 + 4.93 x
![Page 7: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/7.jpg)
An outlier? Influential?
14121086420
70
60
50
40
30
20
10
0
x
y
![Page 8: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/8.jpg)
An outlier? Influential?
14121086420
70
60
50
40
30
20
10
0
x
y
y = 1.73 + 5.12 x
y = 8.51 + 3.32 x
![Page 9: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/9.jpg)
Impact on regression analyses
• Not every outlier strongly influences the estimated regression function.
• Always determine if estimated regression function is unduly influenced by one or a few cases.
• Simple plots for simple linear regression.• Summary measures for multiple linear
regression.
![Page 10: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/10.jpg)
The hat matrix H
![Page 11: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/11.jpg)
The hat matrix H
Least squares estimates yXXXb '1'
The regression model XY
XYE
Fitted values yXXXXXby '1'ˆ
Hyy ˆ
![Page 12: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/12.jpg)
7
10
15
8
4
3
2
1
y
y
y
y
y
8.231
5.331
5.65.61
42.41
1
1
1
1
2414
2313
2212
2111
xx
xx
xx
xx
X
664.0044.0152.0444.0
044.0994.0979.1058.0
152.0979.1931.0202.0
444.0058.0202.0411.0
'1' XXXXH
36.6
08.10
71.14
85.8
7
10
15
8
664.0044.0152.0444.0
044.0994.0979.1058.0
152.0979.1931.0202.0
444.0058.0202.0411.0
ˆ Hyy
![Page 13: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/13.jpg)
44434241
34333231
24232221
14131211
hhhh
hhhh
hhhh
hhhh
H
444343242141
434333232131
424323222121
414313212111
4
3
2
1
44434241
34333231
24232221
14131211
ˆ
yhyhyhyh
yhyhyhyh
yhyhyhyh
yhyhyhyh
y
y
y
y
hhhh
hhhh
hhhh
hhhh
Hyy
4
3
2
1
y
y
y
y
y
![Page 14: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/14.jpg)
Identifying outlying Y values
![Page 15: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/15.jpg)
Identifying outlying Y values
• Residuals
• Standardized residuals– also called internally studentized residuals
• Deleted residuals
• Deleted t residuals– also called studentized deleted residuals– also called externally studentized residuals
![Page 16: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/16.jpg)
Residuals
iii yye ˆ
Ordinary residuals defined for each observation, i = 1, …, n:
Using matrix notation:
yXXXXyyye '1'ˆ
yHIHyye
![Page 17: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/17.jpg)
Variance of the residuals
yHIHyye
HIeVar 2
iii heVar 12
Residual vector
Variance matrixVariance of the ith residual
Estimated variance of the ith residual
iii hMSEes 1
![Page 18: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/18.jpg)
Standardized residuals
iii
i
ii
hMSE
e
es
ee
1*
Standardized residuals defined for each observation, i = 1, …, n:
Standardized residuals quantify how large the residuals are in standard deviation units.
Standardized residuals larger than 2 or smaller than -2 suggest that the y values are unusual.
![Page 19: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/19.jpg)
An outlying y value?
14121086420
70
60
50
40
30
20
10
0
x
y
![Page 20: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/20.jpg)
x y FITS1 HI1 s(e) RESI1 SRES10.10000 -0.0716 3.4614 0.176297 4.27561 -3.5330 -0.826350.45401 4.1673 5.2446 0.157454 4.32424 -1.0774 -0.249161.09765 6.5703 8.4869 0.127014 4.40166 -1.9166 -0.435441.27936 13.8150 9.4022 0.119313 4.42103 4.4128 0.998182.20611 11.4501 14.0706 0.086145 4.50352 -2.6205 -0.58191...8.70156 46.5475 46.7904 0.140453 4.36765 -0.2429 -0.055619.16463 45.7762 49.1230 0.163492 4.30872 -3.3468 -0.776794.00000 40.0000 23.1070 0.050974 4.58936 16.8930 3.68110
S = 4.711
Unusual Observations
Obs x y Fit SE Fit Residual St Resid21 4.00 40.00 23.11 1.06 16.89 3.68R
R denotes an observation with a large standardized residual
![Page 21: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/21.jpg)
Deleted residuals
If observed yi is extreme, it may “pull” the fitted equation towards itself, thereby yielding a small ordinary residual.
Delete the ith case, estimate the regression function using remaining n-1 cases, and use the x values to predict the response for the ith case.
Deleted residual )(ˆ iiii yyd
![Page 22: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/22.jpg)
Deleted t residuals
A deleted t residual is just a standardized deleted residual:
ii
i
i
ii
hMSE
d
ds
dt
1)(
The deleted t residuals follow a t distribution with ((n-1)-p) degrees of freedom.
![Page 23: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/23.jpg)
109876543210
15
10
5
0
x
y
y = 0.6 + 1.55 x
y = 3.82 - 0.13 x
x y RESI1 TRES1 1 2.1 -1.59 -1.7431 2 3.8 0.24 0.1217 3 5.2 1.77 1.6361 10 2.1 -0.42 -19.7990
![Page 24: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/24.jpg)
14121086420
70
60
50
40
30
20
10
0
x
y
y = 1.73 + 5.12 x
y = 2.96 + 5.04 x
Row x y RESI1 SRES1 TRES1 1 0.10000 -0.0716 -3.5330 -0.82635 -0.81916 2 0.45401 4.1673 -1.0774 -0.24916 -0.24291 3 1.09765 6.5703 -1.9166 -0.43544 -0.42596 ... 19 8.70156 46.5475 -0.2429 -0.05561 -0.05413 20 9.16463 45.7762 -3.3468 -0.77679 -0.76837 21 4.00000 40.0000 16.8930 3.68110 6.69012
![Page 25: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/25.jpg)
Identifying outlying X values
![Page 26: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/26.jpg)
Identifying outlying X values
• Use the diagonal elements, hii, of the hat matrix H to identify outlying X values.
• The hii are called leverages.
![Page 27: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/27.jpg)
Properties of the leverages (hii)
• The hii is a measure of the distance between the X values for the ith case and the means of the X values for all n cases.
• The hii is a number between 0 and 1, inclusive.
• The sum of the hii equals p, the number of parameters.
![Page 28: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/28.jpg)
0 1 2 3 4 5 6 7 8 9
x
Dotplot for x
sample mean = 4.751
h(11) = 0.176 h(20,20) = 0.163h(11,11) = 0.048
HI1 0.176297 0.157454 0.127014 0.119313 0.086145 0.077744 0.065028 0.061276 0.048147 0.049628 0.049313 0.051829 0.055760 0.069311 0.072580 0.109616 0.127489 0.141136 0.140453 0.163492 0.050974
Sum of HI1 = 2.0000
![Page 29: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/29.jpg)
444343242141
434333232131
424323222121
414313212111
4
3
2
1
44434241
34333231
24232221
14131211
ˆ
yhyhyhyh
yhyhyhyh
yhyhyhyh
yhyhyhyh
y
y
y
y
hhhh
hhhh
hhhh
hhhh
Hyy
Properties of the leverages (hii)
If the ith case is outlying in terms of its X values, it has a large leverage value hii, and therefore exercises substantial leverage in determining the fitted value.
![Page 30: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/30.jpg)
Using leverages to identify outlying X values
Minitab flags any observations whose leverage value, hii, is more than 3 times larger than the mean leverage value….
n
p
n
hh
n
iii
1
…or if it’s greater than 0.99.
![Page 31: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/31.jpg)
14121086420
70
60
50
40
30
20
10
0
x
y
286.021
233
n
p
Unusual ObservationsObs x y Fit SE Fit Residual St Resid21 14.0 68.00 71.449 1.620 -3.449 -1.59 X
X denotes an observation whose X value gives it largeinfluence.
x y HI1 14.00 68.00 0.357535
![Page 32: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/32.jpg)
14121086420
70
60
50
40
30
20
10
0
x
y
286.021
233
n
p x y HI213.00 15.00 0.311532
Unusual ObservationsObs x y Fit SE Fit Residual St Resid 21 13.0 15.00 51.66 5.83 -36.66 -4.23RX
R denotes an observation with a large standardized residual.X denotes an observation whose X value gives it large influence.
![Page 33: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/33.jpg)
Identifying influential cases
![Page 34: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/34.jpg)
Influence
• A case is influential if its exclusion causes major changes in the estimated regression function.
![Page 35: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/35.jpg)
Identifying influential cases
• Difference in fits, DFITS
• Cook’s distance measure
![Page 36: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/36.jpg)
DFITS
ii
iii
iii
iiii h
ht
hMSE
yyDFITS
1
ˆ
)(
)(
The difference in fits …
… represent the number of standard deviations that the fitted value increases or decreases when the ith case is included.
![Page 37: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/37.jpg)
DFITS
A case is influential if the absolute value of its DFIT value is …
n
p2
… greater than 1 for small to medium data sets
…greater than for large data sets
![Page 38: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/38.jpg)
14121086420
70
60
50
40
30
20
10
0
x
y
62.021
222
n
p x y DFIT114.00 68.00 -1.23841
![Page 39: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/39.jpg)
14121086420
70
60
50
40
30
20
10
0
x
y
62.021
222
n
p x y DFIT213.00 15.00 -11.4670
![Page 40: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/40.jpg)
Cook’s distance
pMSE
yy
D
n
jijj
i
1
2)(ˆ
Cook’s distance measure …
… considers the influence of the ith case on all n fitted values.
![Page 41: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/41.jpg)
Cook’s distance
• Relate Di to the F(p, n-p) distribution.
• If Di is greater than the 50th percentile, F(0.50, p, n-p), then the ith case has lots of influence.
![Page 42: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/42.jpg)
14121086420
70
60
50
40
30
20
10
0
x
y
7191.0)19,2,50.0( F x y COOK114.00 68.00 0.701960
![Page 43: Outliers and influential data points](https://reader033.vdocument.in/reader033/viewer/2022061421/56812ec9550346895d9468e8/html5/thumbnails/43.jpg)
14121086420
70
60
50
40
30
20
10
0
x
y
7191.0)19,2,50.0( F x y COOK213.00 15.00 4.04801