36-401 Modern Regression HW #8 SolutionsDUE: 11/10/2017 at 3PM
Problem 1 [25 points]
(a)
This is still a linear regression model—the simplest possible one. That being the case, the solution we derivedbefore
β = (XTX)−1XTY
still holds. The design matrix is
X =
11...1
,
so
β = (XTX)−1︸ ︷︷ ︸1/n
XTY︸ ︷︷ ︸=∑
Yi
= Y .
(b)
H = X(XTX)−1XT
=
11...1
(n)−1 (1 1 · · · 1)
=
1n
1n · · · 1
n...
. . . 1n
.... . .
...1n
1n · · · 1
n
So
hii = 1n
for all i = 1, . . . , n.
1
(c)
Y = HY
=
1n
∑ni=1 Yi...
1n
∑ni=1 Yi
=
Y...Y
(d)
ti =Yi − Yi(−i)
si
=
Yi − 1n−1
∑1≤j≤nj 6=i
Yi
√Var(Yi − Yi(−i))
=
Yi − 1n−1
(n∑j=1
Yj − Yi
)√Var(Yi) + Var(Yi(−i))
=Yi − 1
n−1(nY − Yi
)√σ2 + σ2
n−1
=Yi − 1
n−1(nY − Yi
)√nσ2
n−1
=nn−1 (Yi − Y )√
nσ2
n−1
(e)
limYi→∞
ti = limYi→∞
nn−1 (Yi − Y )√
nσ2
n−1
=nn−1 (limYi→∞ Yi − Y )√
nσ2
n−1
=∞
2
Problem 2 [20 points]
H = HHh11 h12 · · · h1nh21 h22 · · · h2n...
. . ....
hn1 · · · · · · hnn
=
h11 h12 · · · h1nh21 h22 · · · h2n...
. . ....
hn1 · · · · · · hnn
h11 h12 · · · h1nh21 h22 · · · h2n...
. . ....
hn1 · · · · · · hnn
So for each diagonal element of H we have,
hii = h2ii +
∑j 6=i
h2ij .
We have expressed hii as a sum of squares, so
hii ≥ 0.
Furthermore,hii ≥ h2
ii,
sohii ≤ 1.
3
Problem 3 [20 points]
Comments omitted. One of the points in data set four produces NaN’s because it has leverage 1. See equation(2) of Lecture Notes 20.attach(anscombe)names(anscombe)
## [1] "x1" "x2" "x3" "x4" "y1" "y2" "y3" "y4"
4 6 8 10 12 14
45
67
89
1011
X
Y
4 6 8 10 12 14−
2−
10
1
X
Res
idua
ls
4 6 8 10 12 14
−2
−1
01
X
Stu
dent
ized
Res
idua
ls
2 4 6 8 10
46
810
1214
X
Coo
k's
Dis
tanc
e
Figure 1: Data Set 1
4
4 6 8 10 12 14
34
56
78
9
X
Y
4 6 8 10 12 14
−2.
0−
1.5
−1.
0−
0.5
0.0
0.5
1.0
X
Res
idua
ls
4 6 8 10 12 14
−2.
0−
1.5
−1.
0−
0.5
0.0
0.5
1.0
X
Stu
dent
ized
Res
idua
ls
2 4 6 8 10
46
810
1214
X
Coo
k's
Dis
tanc
e
Figure 2: Data Set 2
5
4 6 8 10 12 14
68
1012
X
Y
4 6 8 10 12 14
−1
01
23
X
Res
idua
ls
4 6 8 10 12 14
020
040
060
080
010
0012
00
X
Stu
dent
ized
Res
idua
ls
2 4 6 8 10
46
810
1214
X
Coo
k's
Dis
tanc
e
Figure 3: Data Set 3
6
8 10 12 14 16 18
68
1012
X
Y
8 10 12 14 16 18
−1.
5−
0.5
0.0
0.5
1.0
1.5
X
Res
idua
ls
8 10 12 14 16 18
−1.
5−
1.0
−0.
50.
00.
51.
01.
5
X
Stu
dent
ized
Res
idua
ls
2 4 6 8 10
810
1214
1618
X
Coo
k's
Dis
tanc
e
Figure 4: Data Set 4
7
Problem 4 [25 points]
Discussions omitted
(i)
y
20 40 60 80 100 140 180 220
0.0
0.2
0.4
0.6
0.8
1.0
2040
6080
100
age
tri
150
250
350
450
0.0 0.2 0.4 0.6 0.8 1.0
140
180
220
150 250 350 450
chol
(ii)
model <- lm(y ~ age + tri + chol, data = data)
8
(iii)
20 40 60 80 100
−1
01
23
45
Age
Stu
dent
ized
Res
idua
ls
5 10 15 20 25
2040
6080
100
Age
Coo
k's
Dis
tanc
e
150 200 250 300 350 400 450
−1
01
23
45
Triglycerides
Stu
dent
ized
Res
idua
ls
5 10 15 20 25
150
200
250
300
350
400
450
Triglycerides
Coo
k's
Dis
tanc
e
140 160 180 200 220 240
−1
01
23
45
Cholesterol
Stu
dent
ized
Res
idua
ls
5 10 15 20 25
140
160
180
200
220
240
Cholesterol
Coo
k's
Dis
tanc
e
9
(iv)
model <- lm(y ~ age + tri + chol + I(age^2) + I(tri^2) + I(chol^2), data = data)
20 40 60 80 100
010
2030
4050
Age
Stu
dent
ized
Res
idua
ls
5 10 15 20 25
2040
6080
100
Age
Coo
k's
Dis
tanc
e
150 200 250 300 350 400 450
010
2030
4050
Triglycerides
Stu
dent
ized
Res
idua
ls
5 10 15 20 25
150
200
250
300
350
400
450
Triglycerides
Coo
k's
Dis
tanc
e
140 160 180 200 220 240
010
2030
4050
Cholesterol
Stu
dent
ized
Res
idua
ls
5 10 15 20 25
140
160
180
200
220
240
Cholesterol
Coo
k's
Dis
tanc
e
10
which(rstudent(model) > 30)
## 25## 25
(v)
data2 <- data[-25,]
model <- lm(y ~ age + tri + chol + I(age^2) + I(tri^2) + I(chol^2), data = data2)
20 30 40 50 60 70 80
−3
−2
−1
01
2
Age
Stu
dent
ized
Res
idua
ls
5 10 15 20
2030
4050
6070
80
Age
Coo
k's
Dis
tanc
e
150 200 250 300 350 400 450
−3
−2
−1
01
2
Triglycerides
Stu
dent
ized
Res
idua
ls
5 10 15 20
150
200
250
300
350
400
450
Triglycerides
Coo
k's
Dis
tanc
e
11
140 160 180 200 220 240
−3
−2
−1
01
2
Cholesterol
Stu
dent
ized
Res
idua
ls
5 10 15 20
140
160
180
200
220
240
Cholesterol
Coo
k's
Dis
tanc
e
library(knitr)kable(summary(model)$coefficients)
Estimate Std. Error t value Pr(>|t|)(Intercept) -0.0953012 0.0001105 -862.4135881 0.0000000age 0.0000417 0.0000015 28.0644703 0.0000000tri 0.0001036 0.0000003 385.7563999 0.0000000chol 0.0001242 0.0000012 103.1582206 0.0000000I(ageˆ2) 0.0001032 0.0000000 6932.5827948 0.0000000I(triˆ2) 0.0000000 0.0000000 -1.1114373 0.2818524I(cholˆ2) 0.0000000 0.0000000 -0.2724321 0.7885711
kable(confint(model, parm = 2:6, level = 1 - 0.05 / 6))
0.417 % 99.583 %age 0.0000373 0.0000462tri 0.0001028 0.0001044chol 0.0001206 0.0001278I(ageˆ2) 0.0001032 0.0001033I(triˆ2) 0.0000000 0.0000000
12
Problem 5 [10 points]
d = read.table("http://stat.cmu.edu/~larry/secretdata.txt")y = d[,1]x1 = d[,2]x2 = d[,3]x3 = d[,4]x4 = d[,5]
(a)
y
−1.0 −0.5 0.0 0.5 −0.5 0.0 0.5
−3
−1
12
−1.
00.
00.
5
x1
x2
−0.
50.
00.
5
−0.
50.
00.
5
x3
−3 −2 −1 0 1 2 −0.5 0.0 0.5 −1.5 −0.5 0.5 1.5
−1.
5−
0.5
0.5
1.5
x4
Figure 5: Secret Data
(b)
model5 <- lm(y ~ x1 + x2 + x3 + x4)
Summarize the fitted model.
13
14
15
Figure 6: The above residual plot is a spoof of this scene from The Simpsons
Appendix
quartz(height = 12, width = 16)par(mar = c(4,4.5,2,2) + 0.1)par(oma = c(1.5,1,2,1) + 0.1)par(mfrow=c(2,2))par(bg = "azure")
out <- lm(y ~ x1+x2+x3+x4)
plot(x1, residuals(out), col = NA, pch = 19, axes = FALSE, ylab = "Residuals",font.lab = 3, xlab = "", xlim = range(x4), cex.lab=2)
axis(side = 1, at = seq(-2.5,2.5,0.25), labels = as.character(seq(-2.5,2.5,0.25)),font = 5, cex.axis = 1.25)
axis(side = 2, at = seq(-3,2.5,0.5), labels = as.character(seq(-3,2.5,0.5)),font = 5, cex.axis = 1.25)
abline(h = seq(-3,2.5,0.5), col = "gray75", lty = 2)abline(v = seq(-2.5,2.5,0.25), col = "gray80", lty = 2)
16
abline(0,0, lty = 2, col = "gray45")points(x1, residuals(out), col = addTrans("orange",120), pch = 19)points(x1, residuals(out), col = "orange")panel.smooth(x1, residuals(out), col = "orange",cex = 1, lwd = 1.2,
col.smooth = "seagreen", span = 2/3, iter = 3)
plot(x2, residuals(out), col = NA, pch = 19, axes = FALSE, ylab = "Residuals",font.lab = 3, xlab = "", xlim = range(x4), cex.lab=2)
axis(side = 1, at = seq(-2.5,2.5,0.25), labels = as.character(seq(-2.5,2.5,0.25)),font = 5, cex.axis = 1.25)
axis(side = 2, at = seq(-3,2.5,0.5), labels = as.character(seq(-3,2.5,0.5)),font = 5, cex.axis = 1.25)
abline(h = seq(-3,2.5,0.5), col = "gray75", lty = 2)abline(v = seq(-2.5,2.5,0.25), col = "gray80", lty = 2)abline(0,0, lty = 2, col = "gray45")points(x2, residuals(out), col = addTrans("orange",120), pch = 19)points(x2, residuals(out), col = "orange")panel.smooth(x2, residuals(out), col = "orange",cex = 1, lwd = 1.2,
col.smooth = "seagreen", span = 2/3, iter = 3)
plot(x3, residuals(out), col = NA, pch = 19, axes = FALSE, ylab = "Residuals",font.lab = 3, xlab = "", xlim = range(x4), cex.lab=2)
axis(side = 1, at = seq(-2.5,2.5,0.25), labels = as.character(seq(-2.5,2.5,0.25)),font = 5, cex.axis = 1.25)
axis(side = 2, at = seq(-3,2.5,0.5), labels = as.character(seq(-3,2.5,0.5)),font = 5, cex.axis = 1.25)
abline(h = seq(-3,2.5,0.5), col = "gray75", lty = 2)abline(v = seq(-2.5,2.5,0.25), col = "gray80", lty = 2)abline(0,0, lty = 2, col = "gray45")points(x3, residuals(out), col = addTrans("orange",120), pch = 19)points(x3, residuals(out), col = "orange")panel.smooth(x3, residuals(out), col = "orange",cex = 1, lwd = 1.2,
col.smooth = "seagreen", span = 2/3, iter = 3)
plot(x4, residuals(out), col = NA, pch = 19, axes = FALSE, ylab = "Residuals",font.lab = 3, xlab = "", xlim = range(x4), cex.lab=2)
axis(side = 1, at = seq(-2.5,2.5,0.25), labels = as.character(seq(-2.5,2.5,0.25)),font = 5, cex.axis = 1.25)
axis(side = 2, at = seq(-3,2.5,0.5), labels = as.character(seq(-3,2.5,0.5)),font = 5, cex.axis = 1.25)
abline(h = seq(-3,2.5,0.5), col = "gray75", lty = 2)abline(v = seq(-2.5,2.5,0.25), col = "gray80", lty = 2)abline(0,0, lty = 2, col = "gray45")points(x4, residuals(out), col = addTrans("orange",120), pch = 19)points(x4, residuals(out), col = "orange")panel.smooth(x4, residuals(out), col = "orange",cex = 1, lwd = 1.2,
col.smooth = "seagreen", span = 2/3, iter = 3)mtext("Linear Regression Residuals vs. Predictors", side = 3, line = -1.2,
outer = TRUE, font = 3, cex = 2)quartz.save(file = "residuals.png", type = "png")graphics.off()
17
quartz(height = 12, width = 16)par(mar = c(7,4.5,2,2) + 0.1)par(bg = "azure")par(mfrow=c(1,1))plot(out, which = 1, col = NA, pch = 19, axes = FALSE,add.smooth = FALSE,
caption = "", font.lab = 3, sub.caption = "", cex.lab = 2, labels.id = NA)axis(side = 1, at = seq(-3,2.5,0.25), labels = as.character(seq(-3,2.5,0.25)),
font = 5, cex.axis = 1.5)axis(side = 2, at = seq(-3.5,2.5,0.5), labels = as.character(seq(-3.5,2.5,0.5)),
font = 5, cex.axis = 1.5)abline(h = seq(-3,2.5,0.5), col = "gray75", lty = 2)abline(v = seq(-2.5,2.5,0.125), col = "gray80", lty = 2)abline(0,0, lty = 2, col = "gray45")points(fitted(out), residuals(out), col = addTrans("orange",120), pch = 19,
cex = 0.8)points(fitted(out), residuals(out), col = "orange", cex = 0.8)quartz.save(file = "homer.png", type = "png")
18