regression analysis on health insurance coverage rate
TRANSCRIPT
Georgia Institute of Technology
ISyE 6414 Spring 2014
April 25
REGRESSION ANALYSIS ON HEALTH
INSURANCE COVERAGE RATE A Study on Influential Factors of Uninsured Rate in Georgia
Group 8
XueyingLinghu 903004963
Chaoyi Wu 903001682
1
Summary
The proportion of people without health insurance coverage in United States is a
problem that can be affected by demographic and geographical, such as income,
unemployment rate, gender etc. In this paper, we use multiple linear regression to set
up a model to estimate the uninsured rate of counties in Georgia given several
estimators. We found that proportion of people without health insurance coverage is
closely related to the age distribution, median income, poverty level, employment
situation, gender distribution and citizenship status of each county. A large population
between 18-24 years, a large native born population, a large rich population or a
prosperous job market indicate a low uninsured rate. With our model, prediction on
insured population in the future can be obtained with chosen estimators.
The project is divided into four parts. First we chose our topic and set up the problem
statement. In the stage of data preparation, we cleaned the data from data source and
applied Principal Component Analysis on correlated variables. In fitting the model
stage, we used several model selection methods such as stepwise regression, LASSO
and F test to select the best model, which was followed by an inference analysis and
interpretation. The last part is the discussion for future study.
2
Table of Contents
Summary ........................................................................................................................................... 1
Background and Problem Statement ................................................................................................. 3
Data for Modeling ............................................................................................................................. 3
1. Data Source ............................................................................................................................... 3
2. Data exploration ........................................................................................................................ 4
3. Principal Component Analysis for Employment Related Variables .......................................... 7
Fitting the Multiple Linear Regression Model .................................................................................. 8
1. Full Model ................................................................................................................................. 8
2. Exploration on Transformation ................................................................................................. 8
3. Variable Selection ..................................................................................................................... 9
4. Exploration on Interaction ...................................................................................................... 10
5. Model Comparison .................................................................................................................. 11
6. Interpretation ........................................................................................................................... 12
7. Diagnostics .............................................................................................................................. 12
Evolution ......................................................................................................................................... 13
Appendix ........................................................................................................................................... 1
Appendix A. Code ......................................................................................................................... 1
Appendix B.ANOVA Table for Full Model .................................................................................. 4
Appendix C.Exhaustive Search Result ......................................................................................... 5
Appendix D.Stepwise Process ...................................................................................................... 6
3
Background and Problem Statement
The number of people without health insurance coverage in the United States is one of
the primary concerns raised by advocates of health care reform. A person without
health insurance is commonly termed uninsured. According to the United States
Census Bureau, The percentage of the non-elderly population who are uninsured has
been generally increasing since the year 2000.
The causes of this rate of uninsured population remain a matter of political debate.
Americans who are uninsured may be so because their job does not offer insurance;
they are unemployed and cannot pay for insurance; or they may be financially able to
buy insurance but consider the cost prohibitive. Other factors that may influence the
health insurance coverage rate include the age, education level, race, sex and so on.
To better understand the relevant factors influencing health insurance coverage rate
and make prediction on it, we decide to set up a multiple linear regression model that
canbe used to predict the uninsured population given the demographic and
geographical information of Georgia.
It can be expected that income should be a factor that influences people’s decision on
whether they would like to purchase health insurance or not. Age hierarchy may also
tell how uninsured coverage looks like to some extent. Young people who believe
they are healthy and money should not be spent on healthcare are less likely to have
health insurance. Besides, education, race, gender even macroeconomic key
performance indicators such as unemployment rate are all possible to have a
significant relation with the uninsured coverage.
Data for Modeling
1. Data Source
The data is from American FactFinder1. County level data for Georgia state is
extracted from three Health Insurance Coverage Status tables (2010 ACS 3-year
estimates, 2011 ACS 3-year estimates, 2012 ACS 5-year estimates2) and three Income
in the Past 12 Months tables (2010 ACS 3-year estimates, 2011 ACS 3-year estimates,
2012 ACS 5-year estimates). In the health insurance tables, population data is
collected and categorized into different subjects such as Age, Race, Sex, etc.
Extract data from the tables and choose the following variables:
Pop: total civilian noninstitutionalized population3 (for the purpose of simplicity, we
1American FactFinder is a web site used to distribute data collected by the United States CensusBureau
2 The description is available at http://www.census.gov/acs/www/guidance_for_data_users/estimates/
3 People 16 years of age and older residing in the 50 States and the District of Columbia who are not
inmates of institutions (penal, mental facilities, homes for the aged), and who are not on active duty in
the Armed Forces
4
refer it as total population in the following contents). All data collected is for civilian
noninstitutionalized population.
Uninsure: uninsured population/total population. It will be the response in our model.
Age_18_64 (%): population in age 18 to 64 years/total population. We don’t care
about population under 18 because usually their insurance is covered by their families.
We don’t care about population over 65 either because they are qualified for
government’s healthcare program. It needs to point out that a better variable might be
population in age 19-25 years because this group of people usually don’t have high
income and are more likely to ignore the importance of health insurance. However,
data for this group is not complete, so we have to use other data instead.
HgSch_below (%): proportion with less than high school graduate in population 25
years and older.
Unem (%): unemployed population proportion in total labor force
LaFrc (%): In labor force proportion in population 18 years and older
FT (%): population worked full time in the past 12 months/population 18 years and
older
NonFT (%): population worked less than full time in past 12 months/population 18
years and older
NoWork (%): population did not work/population 18 years and older
NtBrn (%): native born population/total population
Female (%): female population/total population
Black (%): black or African American alone population/total population
Inc2Pov_High (%): population with Ratio of Income to Poverty Level at 2.00 and
over in the past 12 months/total population for whom poverty status is determined. A
higher ratio indicates a better financial condition.
Income (dollars): household median income.
Yr (%): categorical variable that is used to catch the differences of data due to three
different surveys. 1 stands for 2012 ACS 5-years estimates, 2 stands for 2011 ACS
3-years estimates, 3 stands for 2010 ACS 3-years estimates.
Data is available for all 159 counties in Georgia in 2012 ACS 5-year estimates table
but only covers information for 92 counties in the other two tables. After removing
some observations with missing data, we have 309 observations.
2. Data exploration
There are four points that should be mentioned about the data.
a) Correlated variables. Variable Unem, LaFrc, FT, NonFT and NoWork are all related
to employment, especially FT+NonFT+NoWork=1. It is interesting to see that the
correlations between FT, NonFT and NoWork are not as strong as that between the
5
three variables and LaFrc respectively. TABLE 1 shows their correlation coefficients.
So far we don’t consider choose one from the five. Instead we will use Principal
Component Analysis (PCA) later to deal with them. There are also correlation exists
among other variables. We need to be cautious in modeling and variable selection.
Correlation Unem LaFrc FT NonFT NoWork
Unem 1.000 -0.124 -0.435 0.084 0.285
LaFrc -0.124 1.000 0.831 0.549 -0.961
FT -0.435 0.831 1.000 0.074 -0.812
NonFT 0.084 0.549 0.074 1.000 -0.641
NoWork 0.285 -0.961 -0.812 -0.641 1.000
TABLE 1. Correlation between Employment Variables
b) Outliers detection. Observation 50(highlighted in red in FIGURE 1) is considered
as an outlier. One reason is that judging from both plots in FIGURE 1, it locates far
away from other points in Uninsure but for other variables it is not special, which
means it is an abnormal events just in the response variable. The other reason is that
the observation is for County Echols, which is the only place in Georgia of
banishment for many of Georgia's criminals. Though no confirmed causality can link
the fact to the abnormally high uninsured proportion, we prefer remove it from data
before modeling. Now, 308 observations are ready for modeling.
FIGURE 1. Scatter Plot with Outlier in Red
c) The total population is highly screwed (see FIGURE 2) and is actually took into
account with other data—for most percentage variables total population is the
denominator, we prefer not to add this variable into model as a predictor.
6
FIGURE 2. Population Distribution
d) The data is not suitable for Binomial or Poisson regression because no cross table
or count data available. For example, no specific population size is found for females
who were between 18 and 64 and worked as full time in last 12 months. Due to the
restricted data availability, multiple regression is the optimal choice.
e) From the right scatter plots in FIGURE 1, we can tell that except the plots for
HgSch_below and Inc2Pov_High, no other plots show a clear pattern between the
predictors and the response. HgSch_below shows a curving positive trend with
Uninsure and Inc2Pov_High has a nonlinear negative trend relation with Uninsure.
The plots indicate that we need to do some data transformations. After several tries, it
seems that log transformation for Uninsure and HgSch_below can improve the
linearity between them. The ladder plot (FIGURE 3) below verifies the conclusion.
The index in the top of the plot is the power for HgSch_below with 0 meaning log
transformation. The index in the right is for Uninsure. In the plot, log(Uninsure) vs
log(HgSch_below) is most linear. We may try a regression with transformed data in
next section. No effective transformations for other variables.
FIGURE 3. Summary of Transformations with Different Powers
7
3. Principal Component Analysis for Employment Related Variables
Principal component analysis (PCA) is a statistical procedure that uses orthogonal
transformation to convert a set of observations of possibly correlated variables into a
set of values of linearly uncorrelated variables called principal components. This
transformation is defined in such a way that the first principal component has the
largest possible variance (that is, accounts for as much of the variability in the data as
possible), and each succeeding component in turn has the highest variance possible
under the constraint that it is orthogonal to (i.e., uncorrelated with) the preceding
components. Principal components are guaranteed to be independent if the data set is
jointly normally distributed.
According to the data we collect, there are five predictors that are used to describe the
employment status and working experience of people in each county of Georgia,
including Unem, LaFrc, FT, NonFT, NoWork. FIGURE 4 shows the relationship
between those predictors. We can find that there is a linear relationship between those
variables which indicates they are highly correlated. To avoid including too many
correlated predictors and simplify our model, we apply PCA to convert the
observations of these five variables into a set of uncorrelated variables called principal
components, and use the influential principal components as our new predictors to do
regression analysis.
After applying the PCA function in R, we find that the first two principle components
have explained 0.9382964 of the variability which can be seen in FIGURE
5.Therefore we decide to use them as our new predictors named as work1 and work2
since they represent the factors concerning employment and working status. We also
get the linear combination of the five original predictors (which are standardized) and
their corresponding weights (loadings) contributing to work1 and work2:
work1= −0.613LaFrc − 0.422FT − 0.208NonFT + 0.630NoWork
work2=0.355Unem− 0.584FT + 0.715NonFT − 0.128NoWork
FIGURE 5.Proportion of Variance FIGURE 4. Scatter Plot for Unem,
LaFrc, FT, NonFT and NOWork
8
Data is now ready for modeling. We bring variables Age_18_64, HgSch_below,
NtBrn, Black, Inc2Pov_High, Yr, Female, work1, work2, Income and the response
variable Uninsure into next section for modeling.
Fitting the Multiple Linear Regression Model
1. Full Model
First include all variables into the model and TABLE 2 shows the estimators.
According to the p value, Age_18_64, NrBrn, Inc2Pov_High, Female, work1 and
Income have a significant effect on the response at a significance level of 0.001 while
HgSch_below, Black and Yr and work2’s parameters are not stable. That means either
education level is not an important factor or its effect is removed by other correlated
predictors. Black and work2 also might not be effective predictors for similar reasons.
Yr is used as indicator to show the possible impact on the regression brought by the
different approached to get the data. Its insignificance makes sure that we can drop the
indicator and do not need to worry about this hard-to-control factor.
Coefficients Estimate Std. Error p value
Intercept 1.265e+02 1.015e+01 < 2e-16
Age_18_64 -2.693e-01 6.565e-02 5.32e-05
HgSch_below 3.432e-04 3.861e-02 0.992913
NtBrn -2.915e-01 3.998e-02 2.82e-12
Black -6.016e-03 1.022e-02 0.556413
Inc2Pov_High -1.829e-01 4.693e-02 0.000121
as.factor(Yr)2 -1.234e-01 3.299e-01 0.708713
as.factor(Yr)3 -2.908e-01 3.514e-01 0.408615
Female -8.332e-01 1.193e-01 1.91e-11
work1 -1.864e-01 2.506e-02 1.13e-12
work2 -3.794e-03 3.372e-02 0.910496
Income -2.166e-06 3.806e-07 3.03e-08
TABLE 2. Summary of Full Model
The ANOVA table in Appendix B shows that with all previous predictors in the model,
HgSch_below and Black still can explain some information of the response although
in the summary its individual p value is large.
2. Exploration on Transformation
Since in data exploration, we see that a log transformation for Uninsure and
HgSch_below may increase the linearity. Do log transformation for Uninsure and
HgSch_below and run the regression again. With the finding that the residual variance
is not constant (see FIGURE 6), it is not wise to adopt this data transformation. It is
worthwhile to mention that other data transformations have been tried but none of
9
them makes an improvement on model fitnessand some destroy the assumption of
constant variance, so give up data transformation.
FIGURE 6. Residuals in model with log transformation for Uninsure and HgSch_below
3. Variable Selection
We select the variables with exhaustive search with Mallow’s Cp, Stepwise regression
with AIC and LASSO with Mallow’s Cp.
The three approaches all remove the same variables from the model, namely
HgSch_below, Black and Yr and work2 which happens to be the four with
insignificant parameters in the full model. From FIGURE 7, it can tell that the Cp
reaches the smallest after 10 steps when Age_18_64, NrBrn, Inc2Pov_High, Female,
work1 and Income are in the model (see TABLE 3 for sequence of moves). Outputs
for exhaustive search and Stepwise regression are in the Appendix C and D.
FIGURE 7. LASSO Step and corresponding Cp
10
Sequence of LASSO moves:
Inc2Pov_High HgSch_below NtBrn Female Income work1 Age_18_64 Black Yr work2
Var 5 2 3 10 9 7 1 4 6 8
Step 1 2 3 4 5 6 7 8 9 10
TABLE 3. Sequence of LASSO moves
After model selection, we decide to include Age_18_64, NtBrn, Inc2Pov_High,
Female, work1 and Income into the model (Model 1).
4. Exploration on Interaction
Notice that variable HgSch_below is dropped but in data exploration we find it has an
obvious relation with the response Uninsure. However, we do not want to do a
stepwise regression with the variable being forced to stay in the model since the
method is at the cost of deteriorating model quality. We wonder if there are any
interaction effects between other variables and HgSch_below that should also be
considered. So in next step, we include interaction effects between HgSch_below with
all other variables and obtain a stepwise regression model (Model 2) with residuals
sum of squares=1470.15 and its degree of freedom=294.
Coefficients Estimate Std. Error p value
Intercept 2.054e+02 2.963e+01 2.65e-11
Age_18_64 -7.309e-01 2.190e-01 0.000953
NtBrn -6.005e-01 1.181e-01 6.57e-07
Female -1.430e+00 3.678e-01 0.000125
work1 -3.169e-02 8.249e-02 0.701121
Income -4.527e-06 1.175e-06 0.000142
Inc2Pov_High 1.571e-01 1.672e-01 0.348378
HgSch_below -4.060e+00 1.388e+00 0.003718
Age_18_64:HgSch_below 2.645e-02 1.201e-02 0.028454
NtBrn:HgSch_below 1.468e-02 5.922e-03 0.013743
Female:HgSch_below 2.828e-02 1.706e-02 0.098480
work1:HgSch_below -6.987e-03 3.852e-03 0.070756
Income:HgSch_below 1.292e-07 5.923e-08 0.029967
Inc2Pov_High:HgSch_below -1.667e-02 7.642e-03 0.029991
Adjusted R-squared: 0.6731
TABLE 4. Summary of Model 2
If we look at the model performance in TABLE 4, standard errors for
11
Female:HgSch_below and work1:HgSch_below are quite large and their p values also
indicate the estimates are not stable. So we consider remove these two predictors out
of the model to have a reduced model with residuals sum of squares=1504.34 and its
degree of freedom=296.
Compare the stepwise regression model with interactions and its reduced model with
hypothesis test:
𝐻0: the reduce model catches informationvs 𝐻1: the reduce model loses information.
𝐹 =(𝑅𝑆𝑆𝑟𝑒𝑑𝑢𝑐𝑒−𝑅𝑆𝑆𝑚𝑜𝑑𝑒𝑙 2)/(𝑑𝑓𝑟𝑒𝑑𝑢𝑐𝑒−𝑑𝑓𝑚𝑜𝑑𝑒𝑙 2)
𝑅𝑆𝑆𝑚𝑜𝑑𝑒𝑙 2/𝑑𝑓𝑚𝑜𝑑𝑒𝑙 2admitsF(2,294) distribution under the
null hypothesis. 𝐹 𝑣𝑎𝑙𝑢𝑒 =(1504.34−1470.15 )/(296−294)
1470.15/294= 3.419. The p value=0.03, so
we can reject the null hypothesis and choose Model 2.
5. Model Comparison
Till now, we have found two models (Model 1 vs. Model 2) that are potentially good
to be our final model. Model 1 is the reduced model got by doing model selection
without any interaction. Model 2 is the stepwise model with interaction terms.
TABLE 4 and TABLE 5 list their coefficients and statistics values. Compared Model
1 and Model 2, the adjusted R for Model 2 (0.6731) is a little bit larger than Model 1
(0.6626). By checking the residual plots, Model 2 does not show an improvement in
the constant variance and normality assumption.
For the purpose of prediction, we prefer small model to avoid over fitting, so choose
model 1 as our final model (see TABLE 5 for model summary).
Uninsure = 1.288e+02-2.826e-01*Age_18_64 -2.904e-01*NtBrn -1.820e-01*work1
-8.731e-01* Female -2.137e-06* Income-1.810e-01*Inc2Pov_High
Coefficients Estimate Std. Error t value p value
Intercept 1.288e+02 7.531e+00 17.098 < 2e-16
Age_18_64 -2.826e-01 5.531e-02 -5.110 5.73e-07
NtBrn -2.904e-01 3.950e-02 -7.351 1.87e-12
Inc2Pov_High -1.810e-01 4.234e-02 -4.275 2.57e-05
Female -8.731e-01 9.337e-02 -9.351 < 2e-16
work1 -1.820e-01 2.426e-02 -7.501 7.17e-13
Income -2.137e-06 3.513e-07 -6.084 3.56e-09
Adjusted R-squared: 0.6626
TABLE 5. Summary of Final Model (Model 1)
12
6. Interpretation
The standard errors are small in our final model, which indicates that the estimates are
reliable. By checking the p values we can know that all the coefficients in our model
is significant since they have very small p values. In another word, those predictors
have significant influences on our estimate of the uninsured rate.
According to the all negative coefficients, we can see that the uninsured population
coverage will decrease as the proportion of people whose age are between 18 and 64
increases, the proportion of native born people increases, the proportion of population
with Ratio of Income to Poverty Level increases, the proportion of female population
increases, the principle component of employment factors increases and the median
income of county increases. The variable work1 equals to -0.613LaFrc-0.422FT
-0.208NonFT + 0.630NoWork. That means work1 will increase when LaFrc, FT,
NonFT increase or when NoWork decreases. In another word, work1 is an indicator
positively correlated to job market. In general, the model tells us a better financial
condition may help decreasing uninsured rate.
Use Fulton’s data in 2012 5-year estimates to see how the model works with
prediction. With 95% confidence, the predicted value is 17.67 and the prediction
confidence interval [13.15001, 22.19856]. The interval is quite large. If we want to
predict the uninsured population for the whole Georgia state, we can simply get
individual predicted insured rate simultaneously and sum the uninsured population up.
To get a simultaneous confidence band, we can use Bonferroni method to control the
significance level.
7. Diagnostics
We use the following residual plots to check the assumptions we make on this
multiple linear regression model. First, we assume constant variance. From the
Residuals vs Fitted plot and the Standardized Residual plot in FIGURE 8 we find no
clear patterns of the distribution of residuals which shows that the constant variance
assumption does not be violated. The non-pattern residual plot also confirms the
linearity. However, we doubt slight dependence may exist in the data though there is
no efficient way to check it. Another assumption we make is that the errors are
normally distributed. From the Q-Q plot we can see that points almost fall on the line,
except a few values on the tail which is acceptable in our case, therefore we conclude
that the normality assumption holds. Finally, the Cook’s Distance plot is good with all
the distances are closed to zero. We can conclude that there are no outliers in the
model.
13
FIGURE 8. Residual Analysis of Final Model
Evolution
The data used for modeling contains only demographic and geographical information
however other factors such as prices of insurance product are also very important
factors. Such data should be included into model in future study.
In the project, we use linear regression model to make a prediction. However, to have
a smaller prediction confidence interval, Poisson or Binomial Regression is a better
approach if count data with cross categories are available.
The results of PCA depend on the scaling of the variables. If we want to make a
prediction with PCA variable work1 in our model, we have to first scale variables
Unem, LaFrc, FT, NonFT and Nowork before obtaining the value work1. This step
causes problem because the mean and standard variance used to scale variables for
training data does not take the new data into account. But updated mean and standard
variance will change the coefficients for the estimators. More study on how to use
PCA in prediction should be focused on.
1
Appendix
Appendix A. Code ############## Raw data is Data_Group8.csv ##
test=read.csv("Data_Group8.csv",header=TRUE) ############## First check of data ################
plot(test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure), "red", "black")
,ylab="Uninsure")
par(mfrow=c(4,3))
plot(test$Age_18_64,test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure),
"red", "black"), xlab="Age_18_64", ylab="Uninsure")
plot(test$HgSch_below,test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure),
"red", "black"), xlab="HgSch_below", ylab="Uninsure")
plot(test$Unem,test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure),
"red", "black"), xlab="Unem", ylab="Uninsure")
plot(test$LaFrc,test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure),
"red", "black"), xlab="LaFrc", ylab="Uninsure")
plot(test$HgSch_below,test$FT, col=ifelse(test$Uninsure==max(test$Uninsure),
"red", "black"), xlab="FT", ylab="Uninsure")
plot(test$NonFT,test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure),
"red", "black"), xlab="NonFT", ylab="Uninsure")
plot(test$NoWork,test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure),
"red", "black"), xlab="NoWork", ylab="Uninsure")
plot(test$NtBrn,test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure),
"red", "black"), xlab="NtBrn", ylab="Uninsure")
plot(test$Female,test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure),
"red", "black"), xlab="Female", ylab="Uninsure")
plot(test$Black,test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure),
"red", "black"), xlab="Black", ylab="Uninsure")
plot(test$Inc2Pov_High,test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure),
"red", "black"), xlab="Inc2Pov_High", ylab="Uninsure")
plot(test$Inc2Income,test$Uninsure, col=ifelse(test$Uninsure==max(test$Uninsure),
"red", "black"), xlab="Income", ylab="Uninsure") ############## Remove the outlier Obs 50 from data #######
test[which(test$Uninsure=max(test$Uninsure)),]
data=data.frame(test[-50,-1])
attach(data) ############## PCA ###########################
############## Check the colinearity #########
testpca=cbind(Unem,LaFrc,FT,NonFT,NoWork)
round(cor(testpca),3)
2
testpcam= as.data.frame(testpca)
plot(testpcam)
############## Start PCA #####################
first<-princomp(testpcam,center=TRUE,scale=TRUE)
summary(first)
plot(first)
first$loadings
first$scores
first$scores[,1]
first$scores[,2]
############# Add scores in the table ########
data$work1 <- first$scores[,1]
data$work2<- first$scores[,2] ############# Reset data frame. data1 is the datasset used for modeling#####
data1=data[,-4:-8]
attach(data1) ############# Regression ###########
###Regression with PCA score
out=lm(Uninsure~Age_18_64+HgSch_below+NtBrn+
Black+Inc2Pov_High+as.factor(Yr)+Female+work1+work2+Income)
summary(out)
par(mfrow=c(2,2))
plot(out,which=c(1,2,3,4)) ############## Data transformation ##
library(HH)
ladder(Uninsure~HgSch_below , data=data1)
ladder(Uninsure~Inc2Pov_High , data=data1)
Uni<-log(Uninsure)
HS<-log(HgSch_below)
out2=lm(Uni~Age_18_64+HS+NtBrn+Black+Inc2Pov_High+work1+work2+Income+Female+as.fact
or(Yr))
plot(out2,which=c(1,2,3,4))
summary(out2) ############## Model selection ######
############## Exhaustic Search #####
library(leaps)
x=cbind(Age_18_64,HgSch_below,NtBrn,Black,Inc2Pov_High,as.factor(Yr),work1,work2,Income,F
emale)
outmc=leaps(x,Uninsure,method="Cp",nbest=2)
outmc
which(outmc$Cp==min(outmc$Cp))
# Outconme: exclude: HgScg_below, Black, Yr,work2
3
############# Stepwise ##############
outstep=step(out)
# outstep is the Model 1 in report
summary(outstep)
plot(outstep,which=c(1,2,3,4))
# Outconme: exclude: HgScg_below, Black, Yr,work2 ############# LASSO #################
library(lars)
predictor=scale(x)
uniscale=scale(Uninsure)
outla=lars(x=predictor,y=uniscale)
outla
par(mfrow=c(1,1))
plot(outla)
outla$Cp
plot.lars(outla,xvar="df",plottype="Cp")
# Outconme: exclude: HgScg_below, Black, Yr,work2 ############## Model with interactions ##################
outi=lm(Uninsure~(Age_18_64+NtBrn+Black+Female+work1+work2+Income
+Inc2Pov_High+as.factor(Yr))*HgSch_below)
par(mfrow=c(2,2))
plot(outi,which=c(1,2,3,4))
summary(outi)
outistep=step(outi)
plot(outistep,which=c(1,2,3,4))
summary(outistep)
# outistep is the Model 2 in report ############# Reduced model with interactions############
outire=lm(Uninsure~Age_18_64+NtBrn+Female+work1+Income+Inc2Pov_High+HgSch_below
+Age_18_64:HgSch_below+NtBrn:HgSch_below+Income:HgSch_below+Inc2Pov_High:HgSch_bel
ow)
par(mfrow=c(2,2))
plot(outire,which=c(1,2,3,4))
anova(outire, outistep, test = "Chisq") ############ Prediction with Final model #########
############ Final model is outstep ##############
#######Use Fulton as an example
newdata = data.frame(Age_18_64=67.04666, NtBrn=87.0677, Inc2Pov_High=66.8271,
Female=51.46529, work1=-13.71593, Income=5766400)
predict(outstep, newdata, interval="prediction")
4
Appendix B.ANOVA Table for Full Model
5
Appendix C.Exhaustive Search Result
6
Appendix D.Stepwise Process
7