線模 hw2 - stat.nthu.edu.tw

13
線模 HW2 1.For the data in the problem 2 in Assignment 1. Fit a regression model with the durable press rating (i.e., press) as the response and the four other variables as predictors. Present the output. data1 <- read.table("http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat5410/data/wrinkle.txt", header = T) 針對這筆資料的 EDA 請參考作業一第 2 小題。 a.What percentage of variation in the response is explained by these predictors? fit1 <- lm(press ~ . ,data = data1) summary(fit1) ## ## Call: ## lm(formula = press ~ ., data = data1) ## ## Residuals: ## Min 1Q Median 3Q Max ## -1.07876 -0.63939 -0.08531 0.36236 1.65332 ## ## Coefficients: ## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -0.912212 0.875484 -1.042 0.3074 ## HCHO 0.160726 0.066166 2.429 0.0227 * ## catalyst 0.219783 0.034062 6.452 9.33e-07 *** ## temp 0.011226 0.004973 2.257 0.0330 * ## time 0.101974 0.058735 1.736 0.0948 . ## --- ## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1 ## ## Residual standard error: 0.8365 on 25 degrees of freedom ## Multiple R-squared: 0.6924, Adjusted R-squared: 0.6432 ## F-statistic: 14.07 on 4 and 25 DF, p-value: 3.845e-06 根據題意將 press 作為反應變數,其餘四個作為預測變數去配適線性模型,估計出的模型為 press = −0.912 + 0.161HCHO + 0.220catalyst + 0.011temp + 0.102time (1) 2 summary(fit1)$r.squared %>% round(4) ## [1] 0.6924 表示說在上述模型中,四個預測變數能夠解釋大約 69% 變數 press 的變異。 1 NTHU STAT 5410, 2021 Solution to Homework 2 made by 花聖展, 陳昱瑋, 黃照元, 王浚驊 助教

Upload: others

Post on 03-Feb-2022

7 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 線模 HW2 - stat.nthu.edu.tw

線模 HW2

1.For the data in the problem 2 in Assignment 1. Fit a regression model with the durable press rating (i.e., press)as the response and the four other variables as predictors. Present the output.data1 <- read.table("http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat5410/data/wrinkle.txt",

header = T)

針對這筆資料的 EDA 請參考作業一第 2 小題。

a.What percentage of variation in the response is explained by these predictors?fit1 <- lm(press ~ . ,data = data1)summary(fit1)

#### Call:## lm(formula = press ~ ., data = data1)#### Residuals:## Min 1Q Median 3Q Max## -1.07876 -0.63939 -0.08531 0.36236 1.65332#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) -0.912212 0.875484 -1.042 0.3074## HCHO 0.160726 0.066166 2.429 0.0227 *## catalyst 0.219783 0.034062 6.452 9.33e-07 ***## temp 0.011226 0.004973 2.257 0.0330 *## time 0.101974 0.058735 1.736 0.0948 .## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.8365 on 25 degrees of freedom## Multiple R-squared: 0.6924, Adjusted R-squared: 0.6432## F-statistic: 14.07 on 4 and 25 DF, p-value: 3.845e-06

根據題意將 press 作為反應變數,其餘四個作為預測變數去配適線性模型,估計出的模型為

press = −0.912 + 0.161HCHO + 0.220catalyst + 0.011temp + 0.102time (1)

其 𝑅2 為

summary(fit1)$r.squared %>% round(4)

## [1] 0.6924

表示說在上述模型中,四個預測變數能夠解釋大約 69% 變數 press 的變異。

1

NTHU STAT 5410, 2021 Solution to Homework 2

made by 花聖展, 陳昱瑋, 黃照元, 王浚驊 助教

Page 2: 線模 HW2 - stat.nthu.edu.tw

b.Which observation has the largest (positive) residual? Give the case number.

透過 which.max() 函數能夠得到殘差最大值發生的位置,output 為max_res_index <- which.max(fit1$residuals) %>% as.numeric()max_res_index

## [1] 9

表示最大值為第九筆觀測值,將殘差根據編號畫散佈圖,則殘差最大值位置如下圖所示

plot(fit1$residuals, xlab = "index", ylab = "residual")points(max_res_index, fit1$residuals[max_res_index], col = 2, pch = 19, cex = 1.5)text(12.55, 1.55, labels = "largest residual")

0 5 10 15 20 25 30

−1.

00.

00.

51.

01.

5

index

resi

dual

largest residual

c.Compute the mean and median of the residuals.

殘差的平均值為

mean(fit1$residuals)

## [1] 1.212292e-16

從理論上得知殘差和截距項必定直交,也就是

1𝑇 𝜖 = 0而計算結果為一個非零的小數,其原因來自於計算誤差 (numeric error)。

中位數為

median(fit1$residuals)

## [1] -0.08531249

在假設都滿足的前提下,殘差會服從常態分配,也就是

𝜖𝑖 ∼ 𝑁(0, (𝐼 − 𝐻)��2)

2

NTHU STAT 5410, 2021 Solution to Homework 2

made by 花聖展, 陳昱瑋, 黃照元, 王浚驊 助教

Page 3: 線模 HW2 - stat.nthu.edu.tw

因此殘差中位數 median( 𝜖) 滿足𝐸(median( 𝜖)) = 0

但程式碼出來的結果是估計值,因此未必會為 0 。

d.Compute the correlation of the residuals with the fitted values.

殘差和配適值的相關係數為

cor(fit1$residuals, fit1$fitted.values)

## [1] 1.38365e-16

從幾何觀點,配適值 𝑦 落在 𝑋 生成的空間 Ω,而殘差 𝜖 落在和 Ω 直交的空間 Ω⟂,因此兩向量內積必定等於 0,因此推得其相關係數等於 0 ,然而因為計算誤差 (numeric error) 的關係使得程式計算結果為一個非零的小數。

e.Compute the correlation of the residuals with the formaldehyde concentration (i.e., HCHO).

殘差和 HCHO 的相關係數為

cor(fit1$residuals, data1$HCHO)

## [1] 4.030718e-17

從幾何的觀點,預測變數 HCHO 是模型矩陣 𝑋 的其中一行,則 HCHO 落在 𝑋 生成的空間 Ω 中,因此必定和落在Ω⟂ 的 𝜖 直交,推得相關係數等於 0,程式結果非 0 的原因是因為計算誤差 (numeric error)。

f.Suppose the temperature was increased by 10 while the other predictors were held constant. Predict the changein the press rating.

根據模型 (1) ,可以得到固定 HCHO、catalyst、time 變數的情況下,temp 增加 10 度,對應到 press 變化量的估計為

Δpress = press(temp = temp1 + 10) − press(temp = temp1) = 0.011 × 10 = 0.11

g.Add the variable “HCHC-catalyst” to the model as a predictor. Show the regression output. Add the variable“HCHO/catalyst” to the (original) model as a predictor. Show the output. Why is there no real change in the fitfor former model but there is change for the latter model?

將 HCHO-catalyst 作為一個預測變數放入線性模型中,所得到模型的 summary 如下fit1_2 <- lm(press ~ . + I(HCHO - catalyst) , data = data1)summary(fit1_2)

#### Call:## lm(formula = press ~ . + I(HCHO - catalyst), data = data1)#### Residuals:## Min 1Q Median 3Q Max## -1.07876 -0.63939 -0.08531 0.36236 1.65332#### Coefficients: (1 not defined because of singularities)## Estimate Std. Error t value Pr(>|t|)## (Intercept) -0.912212 0.875484 -1.042 0.3074## HCHO 0.160726 0.066166 2.429 0.0227 *## catalyst 0.219783 0.034062 6.452 9.33e-07 ***## temp 0.011226 0.004973 2.257 0.0330 *## time 0.101974 0.058735 1.736 0.0948 .## I(HCHO - catalyst) NA NA NA NA

3

NTHU STAT 5410, 2021 Solution to Homework 2

made by 花聖展, 陳昱瑋, 黃照元, 王浚驊 助教

Page 4: 線模 HW2 - stat.nthu.edu.tw

## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.8365 on 25 degrees of freedom## Multiple R-squared: 0.6924, Adjusted R-squared: 0.6432## F-statistic: 14.07 on 4 and 25 DF, p-value: 3.845e-06

可以發現 HCHO-catalyst 的估計值是 NA 值,這是因為新增的變數 HCHO-catalyst 為原本變數的線性組合,因此放入模型矩陣 𝑋 中會導致 unidentifiable 的問題,也就是 𝑋𝑇 𝑋 不可逆。summary(lm(press ~ . + I(HCHO / catalyst) , data = data1))

#### Call:## lm(formula = press ~ . + I(HCHO/catalyst), data = data1)#### Residuals:## Min 1Q Median 3Q Max## -1.4903 -0.3585 0.1610 0.4381 1.4827#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) -0.443779 0.805380 -0.551 0.58671## HCHO 0.245017 0.067355 3.638 0.00131 **## catalyst 0.120458 0.048344 2.492 0.02002 *## temp 0.012678 0.004497 2.819 0.00949 **## time 0.099551 0.052724 1.888 0.07115 .## I(HCHO/catalyst) -0.237598 0.089585 -2.652 0.01395 *## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 0.7508 on 24 degrees of freedom## Multiple R-squared: 0.7621, Adjusted R-squared: 0.7125## F-statistic: 15.38 on 5 and 24 DF, p-value: 8.217e-07

• 從 summary 中可以看出模型一除了加入的變數 HCHO-catalyst 之外,其餘變數的估計都一模一樣,這是因為加入 HCHO-catalyst 並未改變 𝑋 生成的空間,因此反應變數 press 投影到該空間的投影量不變,使得配適的模型不變。

• 而 HCHO/catalyst 和其他變數線性獨立 (不獨立的話其 𝛽 的估計會是 NA),因此新增這個變數會改變 𝑋 生成的空間,使得模型二的配適和模型一不同,另外從 summary 中還可得知 HCHO/catalyst 和其他變數並未直交,若有直交的話其他變數的估計仍會和模型一相同。

4

NTHU STAT 5410, 2021 Solution to Homework 2

made by 花聖展, 陳昱瑋, 黃照元, 王浚驊 助教

Page 5: 線模 HW2 - stat.nthu.edu.tw

2.The data was used to study the relationship between fertility and 5 socioeconomic indicators in 1888 Switzerland.The variables in the data are:

Fertility: a standardized fertility measure for each of 47 French-speaking sub-cantons(法國的縣)

of Switzerland around 1888,

Agriculture: percent of population involved in agriculture as an occupation,

Examination: percent of draftees(受徵召入伍者) receiving highest mark on army examination,

Education: percent of population whose education is beyond primary school,

Catholic: percent of population whose are Catholic(天主教徒),

Mortality: percent of live births who live less than 1 year, i.e., infant mortality.

 

data2 <- read.table("http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat5410/data/swiss.txt", header=TRUE)summary(data2)

## Agriculture Examination Education Catholic## Min. : 1.20 Min. : 3.00 Min. : 1.00 Min. : 2.20## 1st Qu.:35.90 1st Qu.:12.00 1st Qu.: 6.00 1st Qu.: 5.20## Median :54.10 Median :16.00 Median : 8.00 Median : 15.10## Mean :50.66 Mean :16.49 Mean :10.98 Mean : 41.14## 3rd Qu.:67.65 3rd Qu.:22.00 3rd Qu.:12.00 3rd Qu.: 93.15## Max. :89.70 Max. :37.00 Max. :53.00 Max. :100.00## Mortality Fertility## Min. :10.80 Min. :35.00## 1st Qu.:18.15 1st Qu.:64.70## Median :20.00 Median :70.40## Mean :19.94 Mean :70.14## 3rd Qu.:21.70 3rd Qu.:78.45## Max. :26.60 Max. :92.50

dim(data2)

## [1] 47 6

每筆資料開始前還是要簡單地檢查其合理性,可以注意到變數們都被標準化過或是以百分比計算,故落在區間

[0,100] 沒有不合理處。另外,這筆資料有 47 筆 observations(n=47)、解釋變數 (包含截距項) 有 6 個 (p=6)。

a.Fit a regression model with Fertility as the response and all the other variables as predictors. Compute theestimated covariance matrix of the regression coefficients.

模型為:

𝑦𝐹𝑒𝑟𝑡𝑖𝑙𝑖𝑡𝑦 = 𝛽0 + 𝛽1𝑥𝐴𝑔𝑟𝑖𝑐𝑢𝑙𝑡𝑢𝑟𝑒 + 𝛽2𝑥𝐸𝑥𝑎𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛 + 𝛽3𝑥𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 + 𝛽4𝑥𝐶𝑎𝑡ℎ𝑜𝑙𝑖𝑐 + 𝛽5𝑥𝑀𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦 + 𝜖

藉由 vcov(),可以直接呼叫出 𝐶𝑜𝑣( 𝛽):lm2 <- lm(Fertility ~ 1 + Agriculture + Examination + Education + Catholic + Mortality, data = data2)lm2.summary <- summary(lm2)

此題模型的 summary:

5

NTHU STAT 5410, 2021 Solution to Homework 2

made by 花聖展, 陳昱瑋, 黃照元, 王浚驊 助教

Page 6: 線模 HW2 - stat.nthu.edu.tw

lm2.summary

#### Call:## lm(formula = Fertility ~ 1 + Agriculture + Examination + Education +## Catholic + Mortality, data = data2)#### Residuals:## Min 1Q Median 3Q Max## -15.2723 -5.2643 0.5014 4.1177 15.3179#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 66.91040 10.70518 6.250 1.91e-07 ***## Agriculture -0.17210 0.07030 -2.448 0.01873 *## Examination -0.25778 0.25387 -1.015 0.31587## Education -0.87095 0.18300 -4.759 2.42e-05 ***## Catholic 0.10414 0.03525 2.954 0.00517 **## Mortality 1.07699 0.38168 2.822 0.00733 **## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 7.165 on 41 degrees of freedom## Multiple R-squared: 0.7068, Adjusted R-squared: 0.671## F-statistic: 19.77 on 5 and 41 DF, p-value: 5.574e-10

配適模型的結果:

𝑦𝐹𝑒𝑟𝑡𝑖𝑙𝑖𝑡𝑦 = 𝛽0 + 𝛽1𝑥𝐴𝑔𝑟𝑖𝑐𝑢𝑙𝑡𝑢𝑟𝑒 + 𝛽2𝑥𝐸𝑥𝑎𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛 + 𝛽3𝑥𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 + 𝛽4𝑥𝐶𝑎𝑡ℎ𝑜𝑙𝑖𝑐 + 𝛽5𝑥𝑀𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦= 66.91040 − 0.17210𝑥𝐴𝑔𝑟𝑖𝑐𝑢𝑙𝑡𝑢𝑟𝑒 − 0.25778𝑥𝐸𝑥𝑎𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛 − 0.87095𝑥𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 + 0.10414𝑥𝐶𝑎𝑡ℎ𝑜𝑙𝑖𝑐 + 1.07699𝑥𝑀𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦

故得,

𝜖 = 𝑦𝐹𝑒𝑟𝑡𝑖𝑙𝑖𝑡𝑦−(66.91040−0.17210𝑥𝐴𝑔𝑟𝑖𝑐𝑢𝑙𝑡𝑢𝑟𝑒−0.25778𝑥𝐸𝑥𝑎𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛−0.87095𝑥𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛+0.10414𝑥𝐶𝑎𝑡ℎ𝑜𝑙𝑖𝑐+1.07699𝑥𝑀𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦)

藉由 vcov(),可以直接呼叫出 𝐶𝑜𝑣( 𝛽):vcov(lm2)

## (Intercept) Agriculture Examination Education Catholic## (Intercept) 114.6008309 -0.4848505096 -1.2025717683 -0.281121045 -0.0222242006## Agriculture -0.4848505 0.0049414193 0.0043716216 0.004787172 -0.0005106843## Examination -1.2025718 0.0043716216 0.0644481217 -0.027302590 0.0051328937## Education -0.2811210 0.0047871724 -0.0273025899 0.033487979 -0.0029982134## Catholic -0.0222242 -0.0005106843 0.0051328937 -0.002998213 0.0012425131## Mortality -3.2651742 0.0065633502 0.0003487616 0.012260841 -0.0027453427## Mortality## (Intercept) -3.2651741723## Agriculture 0.0065633502## Examination 0.0003487616## Education 0.0122608414## Catholic -0.0027453427## Mortality 0.1456821811

6

NTHU STAT 5410, 2021 Solution to Homework 2

made by 花聖展, 陳昱瑋, 黃照元, 王浚驊 助教

Page 7: 線模 HW2 - stat.nthu.edu.tw

若要計算 (Compute),則由以下公式:

𝐶𝑜𝑣( 𝛽) = (𝑋′𝑋)−1��2,where 𝑋 = [1𝑛, 𝑥𝐴𝑔𝑟𝑖𝑐𝑢𝑙𝑡𝑢𝑟𝑒, 𝑥𝐸𝑥𝑎𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛, 𝑥𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛, 𝑥𝐶𝑎𝑡ℎ𝑜𝑙𝑖𝑐, 𝑥𝑀𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦] 為 predictors 作行向量之 model matrix,而 �� = √ 𝑅𝑆𝑆

𝑛−𝑝。

X2 <- cbind(Intercept=1,as.matrix(lm2$model)[,2:6])solve(t(X2)%*%X2)*(lm2.summary$sigma)^2

## Intercept Agriculture Examination Education Catholic## Intercept 114.6008309 -0.4848505096 -1.2025717683 -0.281121045 -0.0222242006## Agriculture -0.4848505 0.0049414193 0.0043716216 0.004787172 -0.0005106843## Examination -1.2025718 0.0043716216 0.0644481217 -0.027302590 0.0051328937## Education -0.2811210 0.0047871724 -0.0273025899 0.033487979 -0.0029982134## Catholic -0.0222242 -0.0005106843 0.0051328937 -0.002998213 0.0012425131## Mortality -3.2651742 0.0065633502 0.0003487616 0.012260841 -0.0027453427## Mortality## Intercept -3.2651741723## Agriculture 0.0065633502## Examination 0.0003487616## Education 0.0122608414## Catholic -0.0027453427## Mortality 0.1456821811

b.Use the residuals from the model in part a as the response in a new model with the same predictors. Comparethe regression summary for this new model with the previous summary. Identify the similarities and differencesand explain mathematically why this occurred.

新的模型:

𝜖 = 𝛽0𝑏 + 𝛽1𝑏𝑥𝐴𝑔𝑟𝑖𝑐𝑢𝑙𝑡𝑢𝑟𝑒 + 𝛽2𝑏𝑥𝐸𝑥𝑎𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛 + 𝛽3𝑏𝑥𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 + 𝛽4𝑏𝑥𝐶𝑎𝑡ℎ𝑜𝑙𝑖𝑐 + 𝛽5𝑏𝑥𝑀𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦 + 𝜖𝑏

y.residuals <- lm2.summary$residualsdata2 <- data2 %>% mutate(y.residuals)lm2b <- lm(y.residuals~1 + Agriculture + Examination + Education + Catholic + Mortality, data = data2)lm2b.summary <- summary(lm2b)

b 小題的 summary:lm2b.summary

#### Call:## lm(formula = y.residuals ~ 1 + Agriculture + Examination + Education +## Catholic + Mortality, data = data2)#### Residuals:## Min 1Q Median 3Q Max## -15.2723 -5.2643 0.5014 4.1177 15.3179#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) -3.334e-15 1.071e+01 0 1## Agriculture -6.773e-18 7.030e-02 0 1

7

NTHU STAT 5410, 2021 Solution to Homework 2

made by 花聖展, 陳昱瑋, 黃照元, 王浚驊 助教

Page 8: 線模 HW2 - stat.nthu.edu.tw

## Examination -3.625e-17 2.539e-01 0 1## Education 4.377e-17 1.830e-01 0 1## Catholic -1.391e-17 3.525e-02 0 1## Mortality 2.839e-16 3.817e-01 0 1#### Residual standard error: 7.165 on 41 degrees of freedom## Multiple R-squared: 9.401e-32, Adjusted R-squared: -0.122## F-statistic: 7.709e-31 on 5 and 41 DF, p-value: 1

可以注意到 𝛽 的 estimate 值都非常小,並非剛好 0 是因為運算出現的 numeric error,因為由 𝜖 與 𝑋 的 columns都直交,可以知道理論上應該要為 0。

b 小題模型配適結果:

𝜖 = 𝛽0𝑏 + 𝛽1𝑏𝑥𝐴𝑔𝑟𝑖𝑐𝑢𝑙𝑡𝑢𝑟𝑒 + 𝛽2𝑏𝑥𝐸𝑥𝑎𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛 + 𝛽3𝑏𝑥𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 + 𝛽4𝑏𝑥𝐶𝑎𝑡ℎ𝑜𝑙𝑖𝑐 + 𝛽5𝑏𝑥𝑀𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦 (2)= 0 (3)

(Remark: 此題模型的 response 為 𝜖,故其配適值記作 𝜖。)

和 a 小題的模型相比,用來估計 �� 的 degrees of freedom(=41)、配適模型的 dimension(=5)、RSS 和 �� 皆相同;不同的是, 𝛽、𝑠.𝑒.( 𝛽) 和 𝑅2。

會有這些不同是因為 residuals(即 𝜖)落在 𝑆 = 𝑠𝑝𝑎𝑛{1𝑛, 𝑥𝐴𝑔𝑟𝑖𝑐𝑢𝑙𝑡𝑢𝑟𝑒, 𝑥𝐸𝑥𝑎𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛, 𝑥𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛, 𝑥𝐶𝑎𝑡ℎ𝑜𝑙𝑖𝑐, 𝑥𝑀𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦}⟂(向量補空間的記號,音讀 “perp”,為 perpendicular 之縮寫),也就是說 𝑅𝑛 = 𝑆 ⨁ 𝑆⟂(⨁ 為 direct sum) 且 𝜖 ∈ 𝑆⟂。𝑅2 = 0(9.401𝑒−32 是前面 𝛽𝑏 估計有誤差導致的) 也顯示了這些解釋變數沒有解釋到 𝜖 的變異,也因此在 RSS 和用來估計 �� 的自由度都一樣下,�� 也沒有變。c.Now use the fitted values from the model in part a as the response in a new model with the same predictors.Compare the regression summary for this new model with the first summary. Identify the similarities and differencesand explain mathematically why this occurred.

由上題可知:

𝑦𝐹𝑒𝑟𝑡𝑖𝑙𝑖𝑡𝑦 = 66.91040−0.17210∗𝑥𝐴𝑔𝑟𝑖𝑐𝑢𝑙𝑡𝑢𝑟𝑒−0.25778∗𝑥𝐸𝑥𝑎𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛−0.87095∗𝑥𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛+0.10414∗𝑥𝐶𝑎𝑡ℎ𝑜𝑙𝑖𝑐+1.07699∗𝑥𝑀𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦

故 c 小題之模型為:

𝑦𝐹𝑒𝑟𝑡𝑖𝑙𝑖𝑡𝑦 = 𝛽0𝑐 + 𝛽1𝑐𝑥𝐴𝑔𝑟𝑖𝑐𝑢𝑙𝑡𝑢𝑟𝑒 + 𝛽2𝑐𝑥𝐸𝑥𝑎𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛 + 𝛽3𝑐𝑥𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛 + 𝛽4𝑐𝑥𝐶𝑎𝑡ℎ𝑜𝑙𝑖𝑐 + 𝛽5𝑐𝑥𝑀𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦 + 𝜖𝑐

data2 <- data2 %>% mutate(fitted=fitted(lm2))lm2c <- lm(fitted~1 + Agriculture + Examination + Education + Catholic + Mortality, data = data2)lm2c.summary <- summary(lm2c)

## Warning in summary.lm(lm2c): essentially perfect fit: summary may be unreliable

c 小題模型的 summary:lm2c.summary

#### Call:## lm(formula = fitted ~ 1 + Agriculture + Examination + Education +## Catholic + Mortality, data = data2)##

8

NTHU STAT 5410, 2021 Solution to Homework 2

made by 花聖展, 陳昱瑋, 黃照元, 王浚驊 助教

Page 9: 線模 HW2 - stat.nthu.edu.tw

## Residuals:## Min 1Q Median 3Q Max## -4.944e-14 -1.896e-15 -3.270e-16 4.276e-15 2.108e-14#### Coefficients:## Estimate Std. Error t value Pr(>|t|)## (Intercept) 6.691e+01 1.433e-14 4.669e+15 <2e-16 ***## Agriculture -1.721e-01 9.409e-17 -1.829e+15 <2e-16 ***## Examination -2.578e-01 3.398e-16 -7.586e+14 <2e-16 ***## Education -8.709e-01 2.450e-16 -3.556e+15 <2e-16 ***## Catholic 1.041e-01 4.718e-17 2.207e+15 <2e-16 ***## Mortality 1.077e+00 5.109e-16 2.108e+15 <2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1#### Residual standard error: 9.59e-15 on 41 degrees of freedom## Multiple R-squared: 1, Adjusted R-squared: 1## F-statistic: 1.103e+31 on 5 and 41 DF, p-value: < 2.2e-16

c 小題模型配適結果為:

𝑦𝐹𝑒𝑟𝑡𝑖𝑙𝑖𝑡𝑦 = 66.9104−0.1721𝑥𝐴𝑔𝑟𝑖𝑐𝑢𝑙𝑡𝑢𝑟𝑒−0.2578𝑥𝐸𝑥𝑎𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛−0.8710𝑥𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛+0.1041𝑥𝐶𝑎𝑡ℎ𝑜𝑙𝑖𝑐+1.0770𝑥𝑀𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦 (= 𝑦𝐹𝑒𝑟𝑡𝑖𝑙𝑖𝑡𝑦)

(Remark: 此題模型的 response 為 𝑦,故其配適值記作 𝑦。)

可以注意到 𝑦𝐹𝑒𝑟𝑡𝑖𝑙𝑖𝑡𝑦 = 𝑦𝐹𝑒𝑟𝑡𝑖𝑙𝑖𝑡𝑦,因此是個 perfect fit,即 𝑅𝑆𝑆 = 0。

和 a 小題的模型相比, 𝛽、用來估計 �� 的 degrees of freedom(=41)、配適模型的 dimension(=5) 皆相同;不同的是,𝑠.𝑒.( 𝛽)、��、RSS 和 𝑅2。

會有這些不同是因為 𝑦 ∈ 𝑆 = 𝑠𝑝𝑎𝑛{1𝑛, 𝑥𝐴𝑔𝑟𝑖𝑐𝑢𝑙𝑡𝑢𝑟𝑒, 𝑥𝐸𝑥𝑎𝑚𝑖𝑛𝑎𝑡𝑖𝑜𝑛, 𝑥𝐸𝑑𝑢𝑐𝑎𝑡𝑖𝑜𝑛, 𝑥𝐶𝑎𝑡ℎ𝑜𝑙𝑖𝑐, 𝑥𝑀𝑜𝑟𝑡𝑎𝑙𝑖𝑡𝑦}。𝑅2 = 1 也顯示了這些解釋變數解釋了所有 𝑦 的變異,也因此 𝛽 會同 a 小題的估計值而 RSS=0(9.59𝑒−15 是 𝛽 的數值估計的進位問題導致)。

9

NTHU STAT 5410, 2021 Solution to Homework 2

made by 花聖展, 陳昱瑋, 黃照元, 王浚驊 助教

Page 10: 線模 HW2 - stat.nthu.edu.tw

3.The data set gives information on capital, labor and value added for each of three economic sectors: Foodand kindred products (20), electrical and electronic machinery, equipment and supplies (36) and transportationequipment (37). For each sector:

一開始看資料集時可以發現最上面的欄位只有 3 欄 (見以下紅色部分)

Figure 1: original dataset

因此首先匯檔前先將欄位處理好 (以下為參考的處理方式),其中,因為不同部門的同一個變數,其仍然視為一個變數 (e.g. 部門代號 20 的 labor 以及部門代號 36 的 labor 都是表示勞工這個變數),因此這裡的處理作法為代表相同變數的會併成同一欄,原本資料的每一欄位皆有 15 筆資料,我們有 3 個部門,所以合併完後每個變數會變成有45 筆資料,並且新增一個變數 economic_sectors 來記錄資料是哪個部門的。以下 head() 的 output 為資料處理完的形式

data3<- read.csv("http://www.stat.nthu.edu.tw/~swcheng/Teaching/stat5410/data/E2.9.txt", header = T)data3 <- sapply(data3[-17,], function(x){strsplit(x,split =" ")})data3 <- data3 %>% lapply(function(x){x[x != ""]})x <- data3[-1] %>% lapply( as.numeric) %>% rbind.data.frame() %>% t() %>% `dimnames<-`(list(1:15,1:10))data3 <- cbind(rep(x[,1], 3), rbind(x[,3*1:3-1], x[,3*1:3], x[,3*1:3+1]))economic_sectors <- c(rep(20,15),rep(36,15),rep(37,15))data3 <- cbind(data3,economic_sectors) %>% as.data.frame()colnames(data3)=c("year","capital","labor","real_value","economic_sectors")head(data3)

## year capital labor real_value economic_sectors## X1 72 243462 708014 6496.96 20## X2 73 252402 699470 5587.34 20## X3 74 246243 697628 5521.32 20## X4 75 263639 674830 5890.64 20## X5 76 276938 685836 6548.57 20## X6 77 290910 678440 6744.80 20

10

NTHU STAT 5410, 2021 Solution to Homework 2

made by 花聖展, 陳昱瑋, 黃照元, 王浚驊 助教

Page 11: 線模 HW2 - stat.nthu.edu.tw

a.Consider the model 𝑉𝑡 = 𝛼𝐾𝛽1𝑡 𝐿𝛽2

𝑡 𝜖𝑡 , where the subscript t indicates year, 𝑉𝑡 is value added, 𝐾𝑡 is capital, 𝐿𝑡is labor, and 𝜖𝑡 is an error term with 𝐸(𝑙𝑜𝑔(𝜖𝑡)) = 0 and 𝑉 𝑎𝑟(𝑙𝑜𝑔(𝜖𝑡)) a constant. Assuming that the errors areindependent, and taking logs of both sides of the above model, estimate 𝛽1 and 𝛽2 .

𝑙𝑜𝑔(𝑉𝑡) = 𝑙𝑜𝑔(𝛼𝐾𝛽1𝑡 𝐿𝛽2

𝑡 𝜖𝑡) = 𝑙𝑜𝑔(𝛼) + 𝑙𝑜𝑔(𝐾𝑡)𝛽1 + 𝑙𝑜𝑔(𝐿𝑡)𝛽2 + 𝑙𝑜𝑔(𝜖𝑡)從此式子可以知道 𝑙𝑜𝑔(𝐾𝑡),𝑙𝑜𝑔(𝐿𝑡) 當作新的解釋變數,𝑙𝑜𝑔(𝑉𝑡) 當作反應變數,𝑙𝑜𝑔(𝛼) 為截距項,𝑙𝑜𝑔(𝜖𝑡) 為符合線性模型假設的殘差項

用 lm() 函數做線性模型的參數估計,並把 coefficient print 出來看即可fit11 <- lm(data=data3,log(real_value)~log(capital)+log(labor),subset=(economic_sectors==20))fit12 <- lm(data=data3,log(real_value)~log(capital)+log(labor),subset=(economic_sectors==36))fit13 <- lm(data=data3,log(real_value)~log(capital)+log(labor),subset=(economic_sectors==37))fit11$coefficients

## (Intercept) log(capital) log(labor)## 25.4928845 0.2268538 -1.4584782

fit12$coefficients

## (Intercept) log(capital) log(labor)## -1.2332115 0.5260689 0.2543206

fit13$coefficients

## (Intercept) log(capital) log(labor)## -9.6259339 0.5056509 0.8454644

Economic Sectors 𝛽1 𝛽2

(20) 0.2268538 -1.4584782(36) 0.5260689 0.2543206(37) 0.5056509 0.8454644

b.The model given in part a above is said to be of the Cobb-Douglas form. It is easier to interpret if 𝛽1 + 𝛽2 = 1.Estimate 𝛽1 and 𝛽2 under this constraint.

首先,透過限制式的資訊將 𝛽2 用 1 − 𝛽1 取代

𝑙𝑜𝑔(𝑉𝑡) = 𝑙𝑜𝑔(𝛼) + 𝛽1𝑙𝑜𝑔(𝐾𝑡) + (1 − 𝛽1)𝑙𝑜𝑔(𝐿𝑡) + 𝑙𝑜𝑔(𝜖𝑡)= 𝑙𝑜𝑔(𝛼) + 𝛽1(𝑙𝑜𝑔(𝐾𝑡) − 𝑙𝑜𝑔(𝐿𝑡)) + 𝑙𝑜𝑔(𝐿𝑡) + 𝑙𝑜𝑔(𝜖𝑡)

= 𝑙𝑜𝑔(𝛼) + 𝛽1𝑙𝑜𝑔(𝐾𝑡𝐿𝑡

) + 𝑙𝑜𝑔(𝐿𝑡) + 𝑙𝑜𝑔(𝜖𝑡)

因此可以將 𝑙𝑜𝑔( 𝐾𝑡𝐿𝑡

) 當作解釋變數,𝑙𝑜𝑔(𝑉𝑡) 當作反應變數,𝑙𝑜𝑔(𝐿𝑡) 當作 offset(此功能可以限定係數為 1)fit21 <- lm(data=data3,log(real_value)~log(capital/labor),subset=(economic_sectors==20),

offset=log(labor))fit22 <- lm(data=data3,log(real_value)~log(capital/labor),subset=(economic_sectors==36),

offset=log(labor))fit23 <- lm(data=data3,log(real_value)~log(capital/labor),subset=(economic_sectors==37),

offset=log(labor))c(fit21$coefficients[2],1-fit21$coefficients[2])

## log(capital/labor) log(capital/labor)## 1.2896953 -0.2896953

11

NTHU STAT 5410, 2021 Solution to Homework 2

made by 花聖展, 陳昱瑋, 黃照元, 王浚驊 助教

Page 12: 線模 HW2 - stat.nthu.edu.tw

c(fit22$coefficients[2],1-fit22$coefficients[2])

## log(capital/labor) log(capital/labor)## 0.90008876 0.09991124

c(fit23$coefficients[2],1-fit23$coefficients[2])

## log(capital/labor) log(capital/labor)## 0.009608932 0.990391068

Economic Sectors 𝛽1 𝛽2

(20) 1.2896953 -0.2896953(36) 0.90008876 0.09991124(37) 0.009608932 0.990391068

c.Sometimes the model 𝑉𝑡 = 𝛼𝛾𝑡𝐾𝛽1𝑡 𝐿𝛽2

𝑡 𝜖𝑡 is considered, where 𝛾𝑡 is assumed to account for technological develop-ment. Estimate 𝛽1 and 𝛽2 for this model.

由題幹可知討論的模型為

𝑙𝑜𝑔(𝑉𝑡) = 𝑙𝑜𝑔(𝛼) + 𝑙𝑜𝑔(𝛾)𝑡 + 𝑙𝑜𝑔(𝐾𝑡)𝛽1 + 𝑙𝑜𝑔(𝐿𝑡)𝛽2 + 𝑙𝑜𝑔(𝜖𝑡).其中,新增的 𝑙𝑜𝑔(𝛾)𝑡 這項進線性模型的反應變數其解釋意義為隨著時間,發展技術也會跟著改變,進而對 realadded value 造成影響.

最後,透過 lm() 配釋出的模型中看 coefficient 即可求出.fit31 <- lm(data=data3,log(real_value)~year+log(capital)+log(labor),subset=(economic_sectors==20))fit32 <- lm(data=data3,log(real_value)~year+log(capital)+log(labor),subset=(economic_sectors==36))fit33 <- lm(data=data3,log(real_value)~year+log(capital)+log(labor),subset=(economic_sectors==37))fit31$coefficients

## (Intercept) year log(capital) log(labor)## 19.55432670 0.01095197 0.04436007 -0.90823598

fit32$coefficients

## (Intercept) year log(capital) log(labor)## -15.41454402 0.02496758 0.82098254 0.88248951

fit33$coefficients

## (Intercept) year log(capital) log(labor)## -10.027158583 0.004579341 0.158555457 1.195294252

Economic Sectors 𝛽1 𝛽2

(20) 0.04436007 -0.90823598(36) 0.82098254 0.88248951(37) 0.158555457 1.195294252

d.Estimate 𝛽1 and 𝛽2 in the model in part c, under the constraint 𝛽1 + 𝛽2 = 1.

首先,把限制式套入 c 小題的模型並做化簡運算,意即:

12

NTHU STAT 5410, 2021 Solution to Homework 2

made by 花聖展, 陳昱瑋, 黃照元, 王浚驊 助教

Page 13: 線模 HW2 - stat.nthu.edu.tw

𝑙𝑜𝑔(𝑉𝑡) = 𝑙𝑜𝑔(𝛼) + 𝑙𝑜𝑔(𝛾)𝑡 + 𝑙𝑜𝑔(𝐾𝑡)𝛽1 + 𝑙𝑜𝑔(𝐿𝑡)(1 − 𝛽1) + 𝑙𝑜𝑔(𝜖𝑡)= 𝑙𝑜𝑔(𝛼) + 𝑙𝑜𝑔(𝛾)𝑡 + (𝑙𝑜𝑔(𝐾𝑡) − 𝑙𝑜𝑔(𝐿𝑡))𝛽1 + 𝑙𝑜𝑔(𝐿𝑡) + 𝑙𝑜𝑔(𝜖𝑡)

= 𝑙𝑜𝑔(𝛼) + 𝑙𝑜𝑔(𝛾)𝑡 + (𝑙𝑜𝑔(𝐾𝑡𝐿𝑡

))𝛽1 + 𝑙𝑜𝑔(𝐿𝑡) + 𝑙𝑜𝑔(𝜖𝑡)

從化簡完後的模型可以得知此模型將 t 與 𝑙𝑜𝑔( 𝐾𝑡𝐿𝑡

)、𝑙𝑜𝑔(𝐿𝑡) 當解釋變數,其中 𝑙𝑜𝑔(𝐿𝑡) 的 coefficient 限定為 1 (lm()中同樣設此項當做 offset),𝑙𝑜𝑔(𝑉𝑡) 反應變數fit41 <- lm(data=data3,log(real_value)~year+log(capital/labor),subset=(economic_sectors==20),

offset=log(labor))fit42 <- lm(data=data3,log(real_value)~year+log(capital/labor),subset=(economic_sectors==36),

offset=log(labor))fit43 <- lm(data=data3,log(real_value)~year+log(capital/labor),subset=(economic_sectors==37),

offset=log(labor))c(fit41$coefficients[3],1-fit41$coefficients[3])

## log(capital/labor) log(capital/labor)## -0.4947025 1.4947025

c(fit42$coefficients[3],1-fit42$coefficients[3])

## log(capital/labor) log(capital/labor)## 0.03450154 0.96549846

c(fit43$coefficients[3],1-fit43$coefficients[3])

## log(capital/labor) log(capital/labor)## -0.3168157 1.3168157

Economic Sectors 𝛽1 𝛽2

(20) -0.4947025 1.4947025(36) 0.03450154 0.96549846(37) -0.3168157 1.3168157

 

13

NTHU STAT 5410, 2021 Solution to Homework 2

made by 花聖展, 陳昱瑋, 黃照元, 王浚驊 助教