[email protected] U37074009
Analysis of the Boston Housing Data from the 1970 census:
Diverse Tests and Model Selection Processes regarding the
Variables in Boston Housing Data
Shuai Yuan1
December 8, 2016
Abstract
In this project, we study the Boston Housing Data that was offered by Harrison and Rubinfeld,
1978. The data contained many different variables that related to Boston Housing for 506 tracts of
Boston from the 1970 census. The data is included in the R package mlbench. Using the data and
R software, we first study the scatterplot matrix and the correlation of different variables to find
their relations briefly. Then, we make various tests for many null hypotheses to examine the
properties of different models. Finally, we perform the model selections by using different methods
such as the forward algorithm, backward algorithm as well as the AIC and BIC criterion to find
and analyze the most fitted model for our data sets. And the same time, we also compute the SSPE
for our subset of data.
1
Contents
1 Introduction 2
2 Analysis 3
2.1 Analysis of the linearity between variables 3
2.1.1 Scatterplot matrix for variables 3
2.1.2 Explanation of Correlation between two variables 4
2.2 The statistical tests for the Null Hypotheses of the fitted model 6
2.3 Model selection by using the forward algorithm 9
2.4 Model selection by using the backward algorithm 11
2.5 Model selection by using the AIC and BIC criterion 12
2.6 Analysis of the related statistics 14
2.6.1 Fit the model by using the subset of the data 14
2.6.2 Compute and analyze the SSPE for subset of the data 14
3 Conclusion 16
4 Appendix 18
2
1 Introduction The data of the Boston Housing from the 1970 census are used in this project. The dataset contains
14 variables with 506 observations. The data is included in the R package mlbench.
In this project, we used various tools to analyze the Boston Housing data and the most frequently
used method is the linear regression. At the same time, we also used hypothesis testing, t-test, F-
test as well as model selection as our methods to analyze the properties of the related data. Using
the data and R software, we first study the scatterplot matrix and the correlation of different
variables to find their relations briefly. Then, we make various tests for many null hypotheses to
examine the properties of different models. Finally, we perform the model selections by using
different methods such as the forward algorithm, backward algorithm as well as the AIC and BIC
criterion to find and analyze the most fitted model for our data sets. And the same time, we also
compute the SSPE for our subset of data.
The outline for the remainder of the paper is as follows. In Section 2, we provide the main results
and analysis towards the multiple aspects of our topics. Section 3 concludes. In the Appendix, we
provide our R codes as well as the related outputs. Finally, we also provide the references that we
use in this project. To be specific, the part 2.1.1 is for the question 1, part 2.1.2 is for the question
2, part 2.2 is for the question 3, part 2.3 is for the question 4, part 2.4 is for the question 5, part 2.5
is for the question 6, part 2.6 is for the question 7.
3
2 Analysis To get a briefly understanding of the relationships between different variables at the very
beginning, we get the scatterplot matrix of these different variables and find the non-linearity
between these variables. Therefore, the correlation of these variables may not appropriate for
describing the relationships within the variables. At the same time, we also compute different test
statistics and test many hypotheses for the general model. Moreover, we also perform variable
selection using forward algorithm, backward algorithm, AIC and BIC criterion. We find that both
criterions select the same model for us and we explain the reason why the selected model is the
one that we need. Finally, we also get the fitted model for subset and compute and compare the
SSPE of the selected models.
2.1 Analysis of the linearity between variables
2.1.1 Scatterplot matrix for variables
First, according to the description of the R Package βmlbenchβ, we can get the meaning of the
following variables as well as the scatterplot matrix for these four variables which are listed below:
πππ: Nitric oxides concentration (parts per 10 million).
πππ ππ: Proportion of residential land zoned for lots over 25,000 sq.ft.
π ππ: Weighted distances to five Boston employment centers.
πππ: Full-value property-tax rate per USD 10,000.
plot 1 Scatterplot matrix for the variables nox, indus, dis, tax
4
According to the scatterplot matrix, we can find that these four variables are all related in some
patterns. For instance, generally speaking, the variable πππ₯ is negatively related to the variable
πππ and the variable ππππ’π is also negatively related to the variable πππ . On the other hand,
generally speaking, the relationships between other variables are positive at the low volume level,
while the relationships may get vague and non-related at the high volume level.
On the other hand, we can also find the possible explanations according to the meanings of these
variables. Because the variable πππ₯ means the βNitric oxides concentration (parts per 10 million).β,
which also represents the degree of air pollution in this area. For the variable πππ , it means the
βWeighted distances to five Boston employment centers.β, which also represents the degree of
living away from the downtown. And for the variable ππππ’π , it means the βProportion of
residential land zoned for lots over 25,000 sq.ft.β, which also represents the level of economy of
residents. Because only if when people own enough money, will they use their money to build
their own parking lots which are also quite wide. Therefore, we can find the possible explanations
for these relationships. As we all know, the air of the area that far away from the downtown is
better because there are more trees and therefore, the level of pollution there can be at a low level.
So it is reason to see that there is negative relationship between the variable πππ₯ and πππ . On the
other hand, the degree of economic development in the areas that far away from the downtown is
worse than that of the downtown areas and therefore, the proportion of residential land zoned for
large lots is smaller than that of the downtown areas. So it is reason to see that there is negative
relationship between the variable ππππ’π and πππ .
2.1.2 Explanation of Correlation between two variables
We know that the formula of correlation coefficient between two variables is that:
Ο23 =Cov(X, Y)D(X) D(Y)
Therefore, according to the R codes, we can find that the correlation between the variable πππ₯ and
the variable πππ is about -0.7692301, which may give us the information that these two variables
are negatively correlated.
5
However, the thing we should not forget is that the correlation coefficient between two variables
is used for examining the relationship for linear regression model, or in other words, the linear
relationship between two variables. But we can find from the scatterplot that the relationship
between the variable πππ₯ and the variable πππ is more likely as exponential relationship, which
means that there is not reasonable to use the correlation coefficient between these two variables to
examine the relationship between them.
On the other hand, we can also test their relationship of them by getting the model between them.
From the model, we assume that there is an exponential relation between them and we get the
significantly p-value for this model. Therefore, according to the discussion above, we can safely
draw the conclusion that we cannot use the correlation between these two variables to quantify the
strength of relationship between the variable πππ₯ and the variable πππ .
6
2.2 The statistical tests for the Null Hypotheses of the fitted model
For this question, the full model given only contains five variables and intercept. π½?means the
intercept, π½@ measures the change of the variable πππ₯ if one unit of the variable πππ increased, π½A
measures the change of the variable πππ₯ if one unit of the variable πππ(πππ ) increases, π½D measure
the change of the variable πππ₯ if one unit of variable πππ ^2 increases,π½H measure the change of
the variable πππ₯ if one unit of the variable ππππ’π increases, π½I measure the change of the variable
πππ₯ if one unit of the variable π‘ππ₯ increases. Since three of them have already be given in the data
set, we just need to transform and add the left two variables which are log(dis) and dis^2. So we
create the new variables whose names are log(dis) and dissquare and will use them to refer the
variable log(dis) and the variable dis^2.
For this section, since we want to decide whether or not specified parameters are equal to 0 or each
other, we will do the F-test for all of three sub-questions. At the beginning, we have the formula
for F-test as below:
πΉ = (π ππYZ β π ππ\ZππYZ β ππ\Z
)/(π ππ\Zππ\Z
)
The table for the summary of F-test value and corresponding p-value for the question are
summarized below:
question a question b question c F-test(value) 5.911 6.0524 42.80353
p-value 0.0154 0.002528 0.0001 Table 1 The F-test(value) and p-value of question a, b, c
We will use it to evaluate the question.
Question a:
According to the definition of the null hypothesis test, the main idea is to test whether the
coefficient of the variable log(dis) is equal to 0 or not. Since the variable log(dis) is the only
target we want to focus here, we can just build a new regression model which does not contain the
variable log(dis) to compare with the original regression model. When we compare two regression
models, we will do the F-test to see if they are significantly different with each other. Form the
7
results of the R codes, the F-value is 5.911 and corresponding p-value is 0.0154. To see whether
we need to reject the null hypothesis, it depends on significant level of alpha. Here, we set the
value of alpha to 0.05. Since the p-value is smaller than 0.05, we will reject the null hypothesis
and conclude that π½A is not equal to 0 at the 95% confidence level. However, if we want to be 99%
confidence about the result, the alpha will change to be 0.01. Since the p-value is bigger than 0.01
here. We cannot reject the null hypothesis based at the 99% confidence level.
Question b:
For part b, we want to make sure whether the coefficient of the variable dis and the variable dis^2
are equal to 0 or not. Since it only focuses on two variable and want to make sure if they are
significantly different from 0. We can do the similar test as part a. For this question, we will build
another regression model which only contains intercept and three variables except the variable
dis^2 and the variable dis. Then we compare the new regression model with the original full
model to see if they are significantly different. We will also do a F-test to compare the two models.
Here the null hypothesis is that π½@ = π½D = 0, the alternative hypothesis is that as least one of them
is not equal to 0. The value for F-test is 6.0524 and corresponding p-value is 0.002528. Similarly,
we will also set the alpha to be 0.05 here. Since the p-value is smaller than 0.05, we will reject the
null hypothesis and conclude that among the π½@, π½D, at least one of them is not equal to 0.
Question c:
Situation for part c is much different. Since the question want to make sure if π½A = π½D = 0 and if
π½H = π½I. We will not use the tradition way as above but use matrix to get the solution. We will
divide the first section(π½A = π½D = 0) as if π½A = 0 and if π½D = 0. So the first line of matrix A has
only a β1β corresponding to the position of π½A and β0β for all the other variables. For the second
line of matrix A, it only has a β1β corresponding to the position of π½D and β0β for all the other
variables. When we compute the matrix A times the variable matrix, the first two line we can get
is only π½A and π½D. To make sure whether or not π½H = π½I. We will put β1β corresponding to the
position of π½H and β-1β corresponding to the position of π½I for the third line of matrix A. So the
output of third line will become π½H - π½I. To test if each of the result we get equal to β0β, we will
make a F-test here. The value for F-test is 42.80353 and p-value corresponding to it is less than
0.0001. Assuming we set alpha = 0.05 here, apparently the p is smaller than alpha, so we will reject
8
the null hypothesis here and conclude that at least one of π½A = π½D is not equal to 0 or π½H is not equal
to π½I.
9
2.3 Model selection by using the forward algorithm
In this section, we will use the method of forward algorithm to analyze the relationship between
response variable and potential explanatory variables below. Moreover, according to the questionβs
requirements, we have transformed the original variables to different formats, which are all
presented below.
Response variable:
π₯π¨π (π¦πππ―), which means that we now use the natural logarithm of the median value of owner-
occupied homes in $1000's.
Potential explanatory variables:
π«π¦^π, which means the square of average number of rooms per dwelling.
π₯π¨π (ππ’π¬), which means the natural logarithm of weighted distances to five Boston employment
centers.
ππ π, which means the proportion of owner-occupied units built prior to 1940.
We performed variable selection using a forward algorithm with a significance level of 5%. For
the forward algorithm, we regressed the models with all variables separately. We name the models
from βforward11β to βforward14β, which you can find with details in Appendix. The results of the
regressions were summarized as the following:
name model variable t - value Pr(>|t|)
forward11 log(medv) ~ 1 intercept 167 <2e-16
forward12 log(medv) ~ rm^2 - 1 rm^2 130 <2e-16
forward13 log(medv) ~ age - 1 age 44.84 <2e-16
forward14 log(medv) ~ log(dis) - 1 log(dis) 54.66 <2e-16
Table 2 The summary of different models from forward11 to forward14
We could observe from the table that while all the variables are significant, the intercept has the
largest t-value. Hence, we chose the intercept to our model. Next, we regressed the intercept with
each of the left three variables in the models called βfoward21β to βfoward23β. The summarized
results are shown in the table below:
10
name model variable t - value Pr(>|t|)
forward21 log(medv) ~ rm^2 rm^2 18.8 <2e-16
forward22 log(medv) ~ age age 11.42 <2e-16
forward23 log(medv) ~ log(dis) log(dis) 9.965 <2e-16
Table 3 The summary of different models from forward21 to forward23
As shown in the table, the p-values of all the variables are significant. However, comparing to the
other variables, the t-value of the variable rm^2 has the largest one. So we added rm^2. Then, we
tested the combination of the variables rm^2, dis, age and intercept separately in the models,
which named as βforward31β and βforward32β. We got the following table as below:
name model variable t - value Pr(>|t|)
forward31 log(medv) ~ rm^2 + log(dis) log(dis) 8.269 1.21e-15
forward32 log(medv) ~ rm^2 + age age -10.23 <2e-16
Table 4 The summary of different models from forward31 to forward32
From the result above, the p-value of all the other variables are significant but the variable age has
smaller p-value than the variable dis. Therefore, we also add the variable age to our model. Finally,
we regressed the response variable log(medv) on all of the variables in the following model
βforward41β.
name model variable t - value Pr(>|t|)
forward41 log(medv) ~ rm^2 + age + log(dis) log(dis) 1.068 0.286
Table 4 The summary of different models from forward41
Based on the table above, the variable log(dis) is not significant in the model and thus, we
removed it from our model. Therefore, after the forward algorithm selection, our final model is
shown as the following,
log medv = π½? + π½@ β ππA + π½A β πππ + π
which Ξ΅ is the error term.
11
2.4 Model selection by using the backward algorithm
In this section, we will use the method of backward algorithm to analyze the relationship between
response variable and potential explanatory variables below. Moreover, according to the questionβs
requirements, we used the transformed formats of the original variables that were defined in the
previous section. We performed variable selection using a backward algorithm with a significance
level of 5%. For the backward algorithm, we regressed the models with all variables at first. We
named the models from βbackward11β, which you can find with details in Appendix. The results
of the regressions were summarized as the following:
name model variable t - value Pr(>|t|)
backward11 log(medv) ~ rm^2 + age + log(dis) intercept 21.224 <2e-16
rm^2 17.676 <2e-16
age -5.758 1.48e-08
log(dis) 1.068 0.286
Table 5 The summary of different models from backward11
Based on the result, we can find that except the variable log(dis) whose t-value is 1.068 and p-
value is 0.286, all the explanatory variables are significant. Thus, we removed the variable dis and
built a new model with the left variables, which is called βbackward21β. Here are the results:
name model variable t - value Pr(>|t|)
backward21 log(medv) ~ rm^2 + age intercept 32.12 <2e-16
rm^2 17.85 <2e-16
age -10.23 <2e-16
Table 6 The summary of different models from backward21
After deleting the variable log(dis) from the model, we got left variables are all significant and
thus, we ended up with the model βbackward21β. We got the same model as that by the process of
forward algorithm,
log medv = π½? + π½@ β ππA + π½A β πππ + π
which Ξ΅ is the error term.
12
2.5 Model selection by using the AIC and BIC criterion
First of all, we can do a preliminary analysis to the full model we are interested. In the full linear
regression model, the t-value and p-value are used to determine whether each of the variable is
significant for the model. Setting alpha = 0.05 here, we can see easily that the three of the variables
rm^2, age and intercept have smallest p-value that also less than 0.05 which means they are
significant. However, the variable log(dis) has the p-value of 0.286 which is not significant at all.
In this section, we will try to perform variable selection using AIC and BIC criterion. The
definition for AIC is that the measure of the relative quality of statistical models for a given set of
data. Given a collection of models for the data, AIC estimates the quality of each model, relative
to each of the other models. Hence, AIC provides a means for model selection. At the same time,
the definition for BIC is that the criterion for model selection among a finite set of models and the
model with the lowest BIC score is preferred. And the formulas for AIC and BIC are shown as
below,
π΄πΌπΆ(π) = π β logπ ππ π
π + 2 β πv
π΅πΌπΆ(π) = π β logπ ππ π
π + log(π) β πv
where π is the regression model, π is the sample size, πv denotes the number of variables in the
model π. In the project, the sample size is 506 and all we need to do is to put all possible regression
model into the R software to compute the corresponding AIC and BIC scores. The candidate
models of the different regression model are summarized as below:
13
Candidate Models AIC Score BIC Score
log(medv) ~ 1 -904.371 -900.145
log(medv) ~ rm^2 - 1 -659.289 -655.063
log(medv) ~ age - 1 321.927 326.154
log(medv) ~ log(dis) - 1 155.969 160.195
log(medv) ~ rm^2 + log(dis) - 1 -750.189 -741.736
log(medv) ~ log(dis) + age - 1 -533.471 -525.018
log(medv) ~ rm^2 + age - 1 -702.556 -694.102
log(medv) ~ age -1018.83 -1010.378
log(medv) ~ rm^2 -1171.36 -1162.907
log(medv) ~ log(dis) -993.379 -984.926
log(medv) ~ rm^2 + log(dis) -1233.86 -1221.175
log(medv) ~ rm^2 + age -1265.07 -1252.394
log(medv) ~ log(dis) + age -1021.36 -1008.683
log(medv) ~ rm^2 + log(dis) + age - 1 -940.149 -929.47
log(medv) ~ rm^2 + log(dis) + age -1264.22 -1247.315
log(medv) ~ -1 1132.453 1132.453
Table 7 The AIC and BIC scores of all possible models
From the table above, we can find that the regression model with smallest AIC score has variable
rm^2, age as well as the intercept. The regression model with the smallest BIC score is the same
model. And when we checked the regression model, we can find the model only contains the
variable rm^2 and the variable log(dis) with intercept. If we set alpha = 0.05 for the model, all
the variables in the model are significant. So we will select the model which contains the variables
rm^2, age and intercept under the AIC as well as BIC criterion.
14
2.6 Analysis of the related statistics
2.6.1 Fit the model by using the subset of the data
According to the results above, we finally choose the model of βm12β, which has the minimum
value of BIC, to be used as our fitted model. From the question 6, we can find that the fitted model
can be written as the following,
log medv = π½? + π½@ β ππA + π½A β πππ + π
which Ξ΅ is the error term. Therefore, we can now use the data from Group1 to fit the above model.
From the results generated by R section, we can find that the fitted model is shown as the following:
log medv = 2.3360 + 0.0256 β ππA β 0.0048 β πππ
Moreover, the p-values of all the explanatory variables are all significant at all level.
2.6.2 Compute and analyze the SSPE for subset of the data
On the other hand, we can also apply another method which is called the Cross-Validation to
further analyze the model selection process. For this method, we need to apply the following
processes. First, we split the data into two different subsets according to a user defined criterion,
Group1 and Group2, which are also called the training data and the validation data. Second, we fit
the model using the data from the Group1. Third, based on the data from the Group2, we make the
prediction for the response variable log(ππππ£)οΏ½. At the same time, we also denote the predicted
value by log(ππππ£)οΏ½ . At last, we compute the value of SSPE, which is also called βSum of
Squared Prediction Errorβ.
Therefore, according to the question, we first divided the original data set βBostonHousingβ into
two Groups, which is the Group1 and the Group2 respectively. And then, we can get the SSPE of
the Group2 by computing the SSPE according to its definition, which is shown as below:
SSPE = (log(ππππ£)οΏ½ β log(ππππ£)οΏ½)Av
οΏ½οΏ½@
In the equation above, log(ππππ£)οΏ½ denotes the response variables in the Group2 and log(ππππ£)οΏ½
denotes the predicted values of the response variable, which were computed by the prediction
function in R section. Therefore, we can compute the SSPE of the Group2 as 0.02835043.
At the same time, we can find that the model we get from the part2.4(question 5) is that,
15
log medv = π½? + π½@ β ππA + π½A β πππ + π
which Ξ΅ is the error term. And the model is the same as we get from the part2.5(question 6).
Therefore, we get the same results for the same model.
16
3 Conclusion
In this project, we first got the scatterplot matrix of four different variables, πππ₯, ππππ’π , πππ and
π‘ππ₯. According to the scatterplot matrix, we found that these four variables are all related in some
patterns. Generally speaking, the variable πππ₯ is negatively related to the variable πππ and the
variable ππππ’π is also negatively related to the variable πππ . On the other hand, generally speaking,
the relationships between other variables are positive at the low volume level, while the
relationships may get vague and non-related at the high volume level. On the other hand, we can
also find the possible explanations according to the meanings of these variables. On the other hand,
we also found that the non-linearity between the variable πππ₯ and the variable πππ . Therefore, we
cannot use the correlation between these two variables to quantify the strength of relationship
between the variable and the variable πππ .
Second, we also made several tests for the Null hypotheses of the fitted model. Using the F-test
and the related p-values, we found that the p-values for the null hypotheses π½A = 0, π½@ = π½D = 0,
π½@ = π½D = 0 and π½H = π½I are all smaller than 0.05, which means we need to reject all the null
hypotheses.
Third, we used the forward algorithm to find the best model for the regression problem. To do that,
we first applied all the variables into the model and we used the p-values of different variables to
test that whether the certain variable is significant in the model. And then, we found that the final
model includes the variable ππA, the variable age as well as the intercept. According to the results,
we finally found the best model. At the same time, we also used the backward algorithm to do the
model selection. By using the backward algorithm, we first applied the model with nothing, and
then, we added the variables one by one into the model to test the p-values of these variables.
Finally, according to the results, we found that the model we found through the backward
algorithm is the same as the one found by using the forward algorithm.
17
And the same time, we also used both the AIC as well as the BIC criterion to do model selection
processes. After doing the model selection, we found that the regression model with smallest AIC
score has variable rm^2, age as well as the intercept. The regression model with the smallest BIC
score is the same model. And when we checked the regression model, we found the model only
contains the variable rm^2 and the variable log(dis) with intercept. If we set alpha = 0.05 for the
model, all the variables in the model are significant. So we will select the model which contains
the variables rm^2, age and intercept under the AIC as well as BIC criterion.
Finally, we also applied another method which is called the Cross-Validation to further analyze
the model selection process. And we also computed the sum of squared prediction error, SSPE, of
the Group2. At the same time, we can find that the model we get from the part2.4(question 5) is
the same as we get from the part2.5(question 6). Therefore, we get the same results for the same
model.
18
4 Appendix
The following materials are the related R codes that used for this project. The contents with bold texts denote the original codes. R codes: # Question 1: > nox <- BostonHousing$nox > indus <- BostonHousing$indus > dis <- BostonHousing$dis > tax <- BostonHousing$tax > pairs(~nox+indus+dis+tax,main="Scatterplot for nox,indus,dis,tax") # Question 2: > cor(nox,dis) [1] -0.7692301 > model <- lm(nox~1/dis) > summary(model) # Question 3: (a) > library("mlbench", lib.loc="~/Library/R/3.3/library") > data("BostonHousing") > BostonHousing <- transform(BostonHousing, logdis = log(dis)) > BostonHousing <- transform(BostonHousing, dissquare = dis*dis) > u1 <- lm(nox ~ dis+logdis+dissquare +indus + tax, BostonHousing) > u2 <- lm(nox ~ dis+dissquare +indus + tax, BostonHousing) > anova(u1,u2) Analysis of Variance Table Model 1: nox ~ dis + logdis + dissquare + indus + tax Model 2: nox ~ dis + dissquare + indus + tax Res.Df RSS Df Sum of Sq F Pr(>F) 1 500 1.6897 2 501 1.7097 -1 -0.019976 5.911 0.0154 * --- Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1 (b) > u3 <- lm(nox ~ logdis +indus + tax, BostonHousing) > anova(u1,u3) Analysis of Variance Table Model 1: nox ~ dis + logdis + dissquare + indus + tax Model 2: nox ~ logdis + indus + tax Res.Df RSS Df Sum of Sq F Pr(>F) 1 500 1.6897
19
2 502 1.7306 -2 -0.040907 6.0524 0.002528 ** --- Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1 (c) > A = matrix(c(0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,1,-1),nrow=3,byrow=TRUE) > Model <- lm(nox ~ dis + logdis + dissquare + indus + tax, BostonHousing) > variance <- (A %*% vcov(Model) %*% t(A)) > E <- eigen(variance, TRUE) > Evalues <- E$values > Evectors <-E$vectors > sqrtvariance <- Evectors %*% diag(1/sqrt(Evalues)) %*% t(Evectors) > Z <- sqrtvariance %*% A %*% coef(Model) > F <- sum(Z^2)/3 > F [1] 42.80353 # Question 4: > Medv<-log(medv) > Rm<-(rm)^2 > Dis<-log(dis) > forward11<-lm(Medv~1) > summary(forward11) Call: lm(formula = Medv ~ 1) Residuals: Min 1Q Median 3Q Max -1.42507 -0.19983 0.01949 0.18436 0.87751 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.03451 0.01817 167 <2e-16 *** --- Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1 Residual standard error: 0.4088 on 505 degrees of freedom > forward12<-lm(Medv~Rm-1) > summary(forward12) Call: lm(formula = Medv ~ Rm - 1)
20
Residuals: Min 1Q Median 3Q Max -2.5860 -0.1694 0.1560 0.4042 2.3811 Coefficients: Estimate Std. Error t value Pr(>|t|) Rm 0.0735845 0.0005646 130.3 <2e-16 *** --- Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1 Residual standard error: 0.5208 on 505 degrees of freedom Multiple R-squared: 0.9711, Adjusted R-squared: 0.9711 F-statistic: 1.699e+04 on 1 and 505 DF, p-value: < 2.2e-16 > forward13<-lm(Medv~age-1) > summary(forward13) Call: lm(formula = Medv ~ age - 1) Residuals: Min 1Q Median 3Q Max -2.0839 -0.5927 0.3357 1.5142 3.4463 Coefficients: Estimate Std. Error t value Pr(>|t|) age 0.0369330 0.0008236 44.84 <2e-16 *** --- Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1 Residual standard error: 1.373 on 505 degrees of freedom Multiple R-squared: 0.7993, Adjusted R-squared: 0.7989 F-statistic: 2011 on 1 and 505 DF, p-value: < 2.2e-16 > forward14<-lm(Medv~Dis-1) > summary(forward14) Call: lm(formula = Medv ~ Dis - 1) Residuals: Min 1Q Median 3Q Max -2.2240 -0.4238 0.6628 1.2085 3.6475 Coefficients: Estimate Std. Error t value Pr(>|t|)
21
Dis 2.17068 0.03972 54.66 <2e-16 *** --- Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1 Residual standard error: 1.165 on 505 degrees of freedom Multiple R-squared: 0.8554, Adjusted R-squared: 0.8551 F-statistic: 2987 on 1 and 505 DF, p-value: < 2.2e-16 > forward21<-lm(Medv~Rm) > summary(forward21) Call: lm(formula = Medv ~ Rm) Residuals: Min 1Q Median 3Q Max -1.20269 -0.10530 0.06992 0.17255 1.31948 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.878478 0.063036 29.8 <2e-16 *** Rm 0.028909 0.001537 18.8 <2e-16 *** --- Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1 Residual standard error: 0.3137 on 504 degrees of freedom Multiple R-squared: 0.4123, Adjusted R-squared: 0.4112 F-statistic: 353.6 on 1 and 504 DF, p-value: < 2.2e-16 > forward22<-lm(Medv~age) > summary(forward22) Call: lm(formula = Medv ~ age) Residuals: Min 1Q Median 3Q Max -1.21816 -0.20280 -0.01733 0.16722 1.08442 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 3.4860274 0.0427295 81.58 <2e-16 *** age -0.0065843 0.0005765 -11.42 <2e-16 *** --- Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1
22
Residual standard error: 0.3647 on 504 degrees of freedom Multiple R-squared: 0.2056, Adjusted R-squared: 0.204 F-statistic: 130.4 on 1 and 504 DF, p-value: < 2.2e-16 > forward23<-lm(Medv~Dis) > summary(forward23) Call: lm(formula = Medv ~ Dis) Residuals: Min 1Q Median 3Q Max -1.18240 -0.21227 -0.02365 0.16558 1.20522 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.66935 0.04024 66.338 <2e-16 *** Dis 0.30737 0.03084 9.965 <2e-16 *** --- Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1 Residual standard error: 0.374 on 504 degrees of freedom Multiple R-squared: 0.1646, Adjusted R-squared: 0.163 F-statistic: 99.31 on 1 and 504 DF, p-value: < 2.2e-16 > forward31<-lm(Medv~Rm+Dis) > summary(forward31) Call: lm(formula = Medv ~ Rm + Dis) Residuals: Min 1Q Median 3Q Max -1.05461 -0.12689 0.03383 0.16131 1.46235 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 1.746011 0.061332 28.468 < 2e-16 *** Rm 0.026088 0.001484 17.585 < 2e-16 *** Dis 0.206437 0.024965 8.269 1.21e-15 *** --- Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1 Residual standard error: 0.2946 on 503 degrees of freedom Multiple R-squared: 0.4827, Adjusted R-squared: 0.4806 F-statistic: 234.6 on 2 and 503 DF, p-value: < 2.2e-16
23
> forward32<-lm(Medv~Rm+age) > summary(forward32) Call: lm(formula = Medv ~ Rm + age) Residuals: Min 1Q Median 3Q Max -1.0789 -0.1094 0.0335 0.1300 1.4183 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.3346303 0.0726764 32.12 <2e-16 *** Rm 0.0256312 0.0014361 17.85 <2e-16 *** age -0.0047407 0.0004632 -10.23 <2e-16 *** --- Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1 Residual standard error: 0.2856 on 503 degrees of freedom Multiple R-squared: 0.5136, Adjusted R-squared: 0.5117 F-statistic: 265.6 on 2 and 503 DF, p-value: < 2.2e-16 > forward41<-lm(Medv~Rm+Dis+age) > summary(forward41) Call: lm(formula = Medv ~ Rm + Dis + age) Residuals: Min 1Q Median 3Q Max -1.06502 -0.11534 0.02519 0.13058 1.43388 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.2520854 0.1061091 21.224 < 2e-16 *** Rm 0.0254895 0.0014420 17.676 < 2e-16 *** Dis 0.0402145 0.0376701 1.068 0.286 age -0.0041510 0.0007209 -5.758 1.48e-08 *** --- Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1 Residual standard error: 0.2856 on 502 degrees of freedom Multiple R-squared: 0.5147, Adjusted R-squared: 0.5118 F-statistic: 177.5 on 3 and 502 DF, p-value: < 2.2e-16
24
> forward<-lm(Medv~Rm+age) > summary(forward) Call: lm(formula = Medv ~ Rm + age) Residuals: Min 1Q Median 3Q Max -1.0789 -0.1094 0.0335 0.1300 1.4183 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.3346303 0.0726764 32.12 <2e-16 *** Rm 0.0256312 0.0014361 17.85 <2e-16 *** age -0.0047407 0.0004632 -10.23 <2e-16 *** --- Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1 Residual standard error: 0.2856 on 503 degrees of freedom Multiple R-squared: 0.5136, Adjusted R-squared: 0.5117 F-statistic: 265.6 on 2 and 503 DF, p-value: < 2.2e-16 # Question 5: > backward11<-lm(Medv ~Rm+ age+ Dis) > summary(backward11) Call: lm(formula = Medv ~ Rm + age + Dis) Residuals: Min 1Q Median 3Q Max -1.06502 -0.11534 0.02519 0.13058 1.43388 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.2520854 0.1061091 21.224 < 2e-16 *** Rm 0.0254895 0.0014420 17.676 < 2e-16 *** age -0.0041510 0.0007209 -5.758 1.48e-08 *** Dis 0.0402145 0.0376701 1.068 0.286 --- Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1 Residual standard error: 0.2856 on 502 degrees of freedom Multiple R-squared: 0.5147, Adjusted R-squared: 0.5118 F-statistic: 177.5 on 3 and 502 DF, p-value: < 2.2e-16
25
> backward21<-lm(Medv ~Rm+ age) > summary(backward21) Call: lm(formula = Medv ~ Rm + age) Residuals: Min 1Q Median 3Q Max -1.0789 -0.1094 0.0335 0.1300 1.4183 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.3346303 0.0726764 32.12 <2e-16 *** Rm 0.0256312 0.0014361 17.85 <2e-16 *** age -0.0047407 0.0004632 -10.23 <2e-16 *** --- Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1 Residual standard error: 0.2856 on 503 degrees of freedom Multiple R-squared: 0.5136, Adjusted R-squared: 0.5117 F-statistic: 265.6 on 2 and 503 DF, p-value: < 2.2e-16 # Question 6: > data("BostonHousing",package="mlbench") > BostonHousing <- transform(BostonHousing, logdis = log(dis)) > BostonHousing <- transform(BostonHousing, logmedv = log(medv)) > BostonHousing <- transform(BostonHousing, rmsq = rm*rm) > attach(BostonHousing) > n <- 506 > m1<-lm(logmedv~1) > m2<-lm(logmedv~rmsq-1) > m3<-lm(logmedv~age-1) > m4<-lm(logmedv~logdis-1) > m5<-lm(logmedv~rmsq+logdis-1) > m6<-lm(logmedv~logdis+age-1) > m7<-lm(logmedv~rmsq+age-1) > m8<-lm(logmedv~age) > m9<-lm(logmedv~rmsq) > m10<-lm(logmedv~logdis) > m11<-lm(logmedv~rmsq+logdis) > m12<-lm(logmedv~rmsq+age) > m13<-lm(logmedv~logdis+age) > m14<-lm(logmedv~rmsq+logdis+age-1) > m15<-lm(logmedv~rmsq+logdis+age)
26
> m16<-lm(logmedv~-1) > > AIC1=n*log(sum(m1$residuals^2)/n)+2*1 > AIC2=n*log(sum(m2$residuals^2)/n)+2*1 > AIC3=n*log(sum(m3$residuals^2)/n)+2*1 > AIC4=n*log(sum(m4$residuals^2)/n)+2*1 > AIC5=n*log(sum(m5$residuals^2)/n)+2*2 > AIC6=n*log(sum(m6$residuals^2)/n)+2*2 > AIC7=n*log(sum(m7$residuals^2)/n)+2*2 > AIC8=n*log(sum(m8$residuals^2)/n)+2*2 > AIC9=n*log(sum(m9$residuals^2)/n)+2*2 > AIC10=n*log(sum(m10$residuals^2)/n)+2*2 > AIC11=n*log(sum(m11$residuals^2)/n)+2*3 > AIC12=n*log(sum(m12$residuals^2)/n)+2*3 > AIC13=n*log(sum(m13$residuals^2)/n)+2*3 > AIC14=n*log(sum(m14$residuals^2)/n)+2*3 > AIC15=n*log(sum(m15$residuals^2)/n)+2*4 > AIC16=n*log(sum(m16$residuals^2)/n)+2*0 > > > BIC1=n*log(sum(m1$residuals^2)/n)+log(n)*1 > BIC2=n*log(sum(m2$residuals^2)/n)+log(n)*1 > BIC3=n*log(sum(m3$residuals^2)/n)+log(n)*1 > BIC4=n*log(sum(m4$residuals^2)/n)+log(n)*1 > BIC5=n*log(sum(m5$residuals^2)/n)+log(n)*2 > BIC6=n*log(sum(m6$residuals^2)/n)+log(n)*2 > BIC7=n*log(sum(m7$residuals^2)/n)+log(n)*2 > BIC8=n*log(sum(m8$residuals^2)/n)+log(n)*2 > BIC9=n*log(sum(m9$residuals^2)/n)+log(n)*2 > BIC10=n*log(sum(m10$residuals^2)/n)+log(n)*2 > BIC11=n*log(sum(m11$residuals^2)/n)+log(n)*3 > BIC12=n*log(sum(m12$residuals^2)/n)+log(n)*3 > BIC13=n*log(sum(m13$residuals^2)/n)+log(n)*3 > BIC14=n*log(sum(m14$residuals^2)/n)+log(n)*3 > BIC15=n*log(sum(m15$residuals^2)/n)+log(n)*4 > BIC16=n*log(sum(m16$residuals^2)/n)+log(n)*0 > min(AIC1,AIC2,AIC3,AIC4,AIC5,AIC6,AIC7,AIC8,AIC9,AIC10,AIC11,AIC12,AIC13,AIC14,AIC15,AIC16) [1] -1265.073 > AIC12 [1] -1265.073 > min(BIC1,BIC2,BIC3,BIC4,BIC5,BIC6,BIC7,BIC8,BIC9,BIC10,BIC11,BIC12,BIC13,BIC14,BIC15,BIC16) [1] -1252.394
27
> BIC12 [1] -1252.394 # Question 7: > data("BostonHousing",package="mlbench") > BostonHousing <- transform(BostonHousing, logdis = log(dis)) > BostonHousing <- transform(BostonHousing, logmedv = log(medv)) > BostonHousing <- transform(BostonHousing, rmsq = rm*rm) > attach(BostonHousing) > > Group1 <- subset(BostonHousing,BostonHousing$zn!=55.0) > Group2 <- subset(BostonHousing,BostonHousing$zn==55.0) > fitmodel <- lm(logmedv~rmsq+age,data = Group1) > summary(fitmodel) Call: lm(formula = logmedv ~ rmsq + age, data = Group1) Residuals: Min 1Q Median 3Q Max -1.07887 -0.10964 0.03389 0.13020 1.41838 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 2.3360096 0.0729281 32.03 <2e-16 *** rmsq 0.0256286 0.0014406 17.79 <2e-16 *** age -0.0047542 0.0004661 -10.20 <2e-16 *** --- Signif. codes: 0 β***β 0.001 β**β 0.01 β*β 0.05 β.β 0.1 β β 1 Residual standard error: 0.2864 on 500 degrees of freedom Multiple R-squared: 0.5129, Adjusted R-squared: 0.5109 F-statistic: 263.2 on 2 and 500 DF, p-value: < 2.2e-16 > p <- predict(fitmodel,newdata=Group2) > SSPE <- sum((Group2$logmedv-p)^2) > SSPE [1] 0.02835043