outliers and influential data points. no outliers?
TRANSCRIPT
Outliers and influential data points
No outliers?
14121086420
70
60
50
40
30
20
10
0
x
y
An outlier? Influential?
14121086420
70
60
50
40
30
20
10
0
x
y
An outlier? Influential?
14121086420
70
60
50
40
30
20
10
0
x
y
y = 1.73 + 5.12 x
y = 2.96 + 5.04 x
An outlier? Influential?
14121086420
70
60
50
40
30
20
10
0
x
y
An outlier? Influential?
14121086420
70
60
50
40
30
20
10
0
x
y
y = 1.73 + 5.12 x
y = 2.47 + 4.93 x
An outlier? Influential?
14121086420
70
60
50
40
30
20
10
0
x
y
An outlier? Influential?
14121086420
70
60
50
40
30
20
10
0
x
y
y = 1.73 + 5.12 x
y = 8.51 + 3.32 x
Impact on regression analyses
• Not every outlier strongly influences the estimated regression function.
• Always determine if estimated regression function is unduly influenced by one or a few cases.
• Simple plots for simple linear regression.• Summary measures for multiple linear
regression.
The hat matrix H
The hat matrix H
Least squares estimates yXXXb '1'
The regression model XY
XYE
Fitted values yXXXXXby '1'ˆ
Hyy ˆ
7
10
15
8
4
3
2
1
y
y
y
y
y
8.231
5.331
5.65.61
42.41
1
1
1
1
2414
2313
2212
2111
xx
xx
xx
xx
X
664.0044.0152.0444.0
044.0994.0979.1058.0
152.0979.1931.0202.0
444.0058.0202.0411.0
'1' XXXXH
36.6
08.10
71.14
85.8
7
10
15
8
664.0044.0152.0444.0
044.0994.0979.1058.0
152.0979.1931.0202.0
444.0058.0202.0411.0
ˆ Hyy
44434241
34333231
24232221
14131211
hhhh
hhhh
hhhh
hhhh
H
444343242141
434333232131
424323222121
414313212111
4
3
2
1
44434241
34333231
24232221
14131211
ˆ
yhyhyhyh
yhyhyhyh
yhyhyhyh
yhyhyhyh
y
y
y
y
hhhh
hhhh
hhhh
hhhh
Hyy
4
3
2
1
y
y
y
y
y
Identifying outlying Y values
Identifying outlying Y values
• Residuals
• Standardized residuals– also called internally studentized residuals
• Deleted residuals
• Deleted t residuals– also called studentized deleted residuals– also called externally studentized residuals
Residuals
iii yye ˆ
Ordinary residuals defined for each observation, i = 1, …, n:
Using matrix notation:
yXXXXyyye '1'ˆ
yHIHyye
Variance of the residuals
yHIHyye
HIeVar 2
iii heVar 12
Residual vector
Variance matrixVariance of the ith residual
Estimated variance of the ith residual
iii hMSEes 1
Standardized residuals
iii
i
ii
hMSE
e
es
ee
1*
Standardized residuals defined for each observation, i = 1, …, n:
Standardized residuals quantify how large the residuals are in standard deviation units.
Standardized residuals larger than 2 or smaller than -2 suggest that the y values are unusual.
An outlying y value?
14121086420
70
60
50
40
30
20
10
0
x
y
x y FITS1 HI1 s(e) RESI1 SRES10.10000 -0.0716 3.4614 0.176297 4.27561 -3.5330 -0.826350.45401 4.1673 5.2446 0.157454 4.32424 -1.0774 -0.249161.09765 6.5703 8.4869 0.127014 4.40166 -1.9166 -0.435441.27936 13.8150 9.4022 0.119313 4.42103 4.4128 0.998182.20611 11.4501 14.0706 0.086145 4.50352 -2.6205 -0.58191...8.70156 46.5475 46.7904 0.140453 4.36765 -0.2429 -0.055619.16463 45.7762 49.1230 0.163492 4.30872 -3.3468 -0.776794.00000 40.0000 23.1070 0.050974 4.58936 16.8930 3.68110
S = 4.711
Unusual Observations
Obs x y Fit SE Fit Residual St Resid21 4.00 40.00 23.11 1.06 16.89 3.68R
R denotes an observation with a large standardized residual
Deleted residuals
If observed yi is extreme, it may “pull” the fitted equation towards itself, thereby yielding a small ordinary residual.
Delete the ith case, estimate the regression function using remaining n-1 cases, and use the x values to predict the response for the ith case.
Deleted residual )(ˆ iiii yyd
Deleted t residuals
A deleted t residual is just a standardized deleted residual:
ii
i
i
ii
hMSE
d
ds
dt
1)(
The deleted t residuals follow a t distribution with ((n-1)-p) degrees of freedom.
109876543210
15
10
5
0
x
y
y = 0.6 + 1.55 x
y = 3.82 - 0.13 x
x y RESI1 TRES1 1 2.1 -1.59 -1.7431 2 3.8 0.24 0.1217 3 5.2 1.77 1.6361 10 2.1 -0.42 -19.7990
14121086420
70
60
50
40
30
20
10
0
x
y
y = 1.73 + 5.12 x
y = 2.96 + 5.04 x
Row x y RESI1 SRES1 TRES1 1 0.10000 -0.0716 -3.5330 -0.82635 -0.81916 2 0.45401 4.1673 -1.0774 -0.24916 -0.24291 3 1.09765 6.5703 -1.9166 -0.43544 -0.42596 ... 19 8.70156 46.5475 -0.2429 -0.05561 -0.05413 20 9.16463 45.7762 -3.3468 -0.77679 -0.76837 21 4.00000 40.0000 16.8930 3.68110 6.69012
Identifying outlying X values
Identifying outlying X values
• Use the diagonal elements, hii, of the hat matrix H to identify outlying X values.
• The hii are called leverages.
Properties of the leverages (hii)
• The hii is a measure of the distance between the X values for the ith case and the means of the X values for all n cases.
• The hii is a number between 0 and 1, inclusive.
• The sum of the hii equals p, the number of parameters.
0 1 2 3 4 5 6 7 8 9
x
Dotplot for x
sample mean = 4.751
h(11) = 0.176 h(20,20) = 0.163h(11,11) = 0.048
HI1 0.176297 0.157454 0.127014 0.119313 0.086145 0.077744 0.065028 0.061276 0.048147 0.049628 0.049313 0.051829 0.055760 0.069311 0.072580 0.109616 0.127489 0.141136 0.140453 0.163492 0.050974
Sum of HI1 = 2.0000
444343242141
434333232131
424323222121
414313212111
4
3
2
1
44434241
34333231
24232221
14131211
ˆ
yhyhyhyh
yhyhyhyh
yhyhyhyh
yhyhyhyh
y
y
y
y
hhhh
hhhh
hhhh
hhhh
Hyy
Properties of the leverages (hii)
If the ith case is outlying in terms of its X values, it has a large leverage value hii, and therefore exercises substantial leverage in determining the fitted value.
Using leverages to identify outlying X values
Minitab flags any observations whose leverage value, hii, is more than 3 times larger than the mean leverage value….
n
p
n
hh
n
iii
1
…or if it’s greater than 0.99.
14121086420
70
60
50
40
30
20
10
0
x
y
286.021
233
n
p
Unusual ObservationsObs x y Fit SE Fit Residual St Resid21 14.0 68.00 71.449 1.620 -3.449 -1.59 X
X denotes an observation whose X value gives it largeinfluence.
x y HI1 14.00 68.00 0.357535
14121086420
70
60
50
40
30
20
10
0
x
y
286.021
233
n
p x y HI213.00 15.00 0.311532
Unusual ObservationsObs x y Fit SE Fit Residual St Resid 21 13.0 15.00 51.66 5.83 -36.66 -4.23RX
R denotes an observation with a large standardized residual.X denotes an observation whose X value gives it large influence.
Identifying influential cases
Influence
• A case is influential if its exclusion causes major changes in the estimated regression function.
Identifying influential cases
• Difference in fits, DFITS
• Cook’s distance measure
DFITS
ii
iii
iii
iiii h
ht
hMSE
yyDFITS
1
ˆ
)(
)(
The difference in fits …
… represent the number of standard deviations that the fitted value increases or decreases when the ith case is included.
DFITS
A case is influential if the absolute value of its DFIT value is …
n
p2
… greater than 1 for small to medium data sets
…greater than for large data sets
14121086420
70
60
50
40
30
20
10
0
x
y
62.021
222
n
p x y DFIT114.00 68.00 -1.23841
14121086420
70
60
50
40
30
20
10
0
x
y
62.021
222
n
p x y DFIT213.00 15.00 -11.4670
Cook’s distance
pMSE
yy
D
n
jijj
i
1
2)(ˆ
Cook’s distance measure …
… considers the influence of the ith case on all n fitted values.
Cook’s distance
• Relate Di to the F(p, n-p) distribution.
• If Di is greater than the 50th percentile, F(0.50, p, n-p), then the ith case has lots of influence.
14121086420
70
60
50
40
30
20
10
0
x
y
7191.0)19,2,50.0( F x y COOK114.00 68.00 0.701960
14121086420
70
60
50
40
30
20
10
0
x
y
7191.0)19,2,50.0( F x y COOK213.00 15.00 4.04801