6101-project report

53
DATS 6101 FINAL PROJECT REPORT --- Bikeshare Count Prediction Data Analysis Project Qing Chen, Love Kumar Tyagi, Ye Yang 2016 Fall 1. Abstract The objective of the project is to use a set of Capital Bikeshare Program system data to predict how many bikes are demanded in a particular hour of a particular date. In order to find the best prediction model, decision trees, linear regression and random forest, three major methods were performed for feature selection and prediction. The Root mean Squared Error (RMSE) and alternative Root Mean Squared Logarithmic Error (RMSLE) were evaluated to test models’ goodness of fit. After comparison, one of the linear regression families, GAM with interaction terms was found to be the best model in this project. 2. Background research and question development a. Motivation and Background research A lot of people now using bikeshare program to commute to schools, work or other places, in addition, the bikeshare program can also solve the last one mile transportation problems. As we saw more and more bike docks are installed and more people are using bikes on the daily basis, it has motivated our curiosity on how these bikes were used and we would like to explore more interesting data scientific questions. After some initial research, we have found the capital bikeshare program which is operated in Washington, D.C. area. In their website i , there is system data open to public for anyone who would like to explore the usage of the bikes, also, we have found bikeshare prediction data set on Kaggle website ii and UCL machine learning repository iii . After we explored the dataset and decided to do further data analysis on this research topics. b. Question Development The question we development is to predict how many bikes are demanded in a particular hour of a particular date giving the hourly

Upload: love-tyagi

Post on 13-Apr-2017

24 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: 6101-Project Report

DATS 6101 FINAL PROJECT REPORT

--- Bikeshare Count Prediction Data Analysis Project

Qing Chen, Love Kumar Tyagi, Ye Yang

2016 Fall1. Abstract

The objective of the project is to use a set of Capital Bikeshare Program system data to predict how many bikes are demanded in a particular hour of a particular date. In order to find the best prediction model, decision trees, linear regression and random forest, three major methods were performed for feature selection and prediction. The Root mean Squared Error (RMSE) and alternative Root Mean Squared Logarithmic Error (RMSLE) were evaluated to test models’ goodness of fit. After comparison, one of the linear regression families, GAM with interaction terms was found to be the best model in this project.

2. Background research and question development

a. Motivation and Background research

A lot of people now using bikeshare program to commute to schools, work or other places, in addition, the bikeshare program can also solve the last one mile transportation problems. As we saw more and more bike docks are installed and more people are using bikes on the daily basis, it has motivated our curiosity on how these bikes were used and we would like to explore more interesting data scientific questions.

After some initial research, we have found the capital bikeshare program which is operated in Washington, D.C. area. In their websitei, there is system data open to public for anyone who would like to explore the usage of the bikes, also, we have found bikeshare prediction data set on Kaggle websiteii and UCL machine learning repositoryiii. After we explored the dataset and decided to do further data analysis on this research topics.

b. Question Development

The question we development is to predict how many bikes are demanded in a particular hour of a particular date giving the hourly weather conditions. We considered this question to be a SMART question for the following reasons:

Specific: The question is very specific because the prediction has targeted particular hour in a particular date. We have been able to narrow our prediction to hourly bike counts.

Measurable: Our analysis has 10886 observations and 12 variables which included important information on weather or not people would prefer using a bike such as temperature, wind speed, and working day or holiday etc. With this data set, we believe the question could be measured and answered.

Agree upon/Answerable: The question is not complicated, as well as the variables. And since the question is very specific, and we believe it can be answered without much uncertainty.

Realistic: By prediction the hourly bikes counts, the program operation team can better understand the peek hour, day and month of the bike, which help them better balance the inventory or manage bike storage/maintenance schedule etc. Our prediction model is very realistic in the program management.

Page 2: 6101-Project Report

Time-Based: Our dataset time spanned from 2011 and 2012, and considering that the weather, and working day, holiday day does not change much each year, so we considered that the model is time-based and can be used to predict current bike counts.

3. Data Preprocessing and Visualization

a. Data Preprocessing

The data set has 10886 observations and 12 variables. Our responsive (dependent) variable is ‘count’ which is total number of bike rented per hour. And there are another two dependent variables are ‘register’ and ‘causal’ bike counts which means how many bikes were rented to a regular registered user or a onetime causal user. The total of ‘register’ and ‘causal’ are equal to ‘count’.

The preprocessing steps included: 1) variables included date and time, weather, temperature, holiday or working day etc; 2) Recoded ‘datetime’ into ‘Year’+ ‘Month’+ ‘Day’+ ‘Hour’ four new variables; 3) Factor categorical variables; 4) set up ‘train’ and ‘test’ set.

#Data Preprocessing#set working directorysetwd("D:/GWU DATA SCI/6101/Project/BIKESHARE")

#read in train/testtrain <- read.csv("train.csv")

#Data Preprocessing-----------------------------------------------------------------------------------------summary(is.na(train))

## datetime season holiday workingday ## Mode :logical Mode :logical Mode :logical Mode :logical ## FALSE:10886 FALSE:10886 FALSE:10886 FALSE:10886 ## NA's :0 NA's :0 NA's :0 NA's :0 ## weather temp atemp humidity ## Mode :logical Mode :logical Mode :logical Mode :logical ## FALSE:10886 FALSE:10886 FALSE:10886 FALSE:10886 ## NA's :0 NA's :0 NA's :0 NA's :0 ## windspeed casual registered count ## Mode :logical Mode :logical Mode :logical Mode :logical ## FALSE:10886 FALSE:10886 FALSE:10886 FALSE:10886 ## NA's :0 NA's :0 NA's :0 NA's :0

head(train)

## datetime season holiday workingday weather temp atemp## 1 2011-01-01 00:00:00 1 0 0 1 9.84 14.395## 2 2011-01-01 01:00:00 1 0 0 1 9.02 13.635## 3 2011-01-01 02:00:00 1 0 0 1 9.02 13.635## 4 2011-01-01 03:00:00 1 0 0 1 9.84 14.395## 5 2011-01-01 04:00:00 1 0 0 1 9.84 14.395## 6 2011-01-01 05:00:00 1 0 0 2 9.84 12.880## humidity windspeed casual registered count

Page 3: 6101-Project Report

## 1 81 0.0000 3 13 16## 2 80 0.0000 8 32 40## 3 80 0.0000 5 27 32## 4 75 0.0000 3 10 13## 5 75 0.0000 0 1 1## 6 75 6.0032 0 1 1

#factorize training settrain_factor <- traintrain_factor$weather <- factor(train$weather)train_factor$holiday <- factor(train$holiday)train_factor$workingday <- factor(train$workingday)train_factor$season <- factor(train$season)str(train)

## 'data.frame': 10886 obs. of 12 variables:## $ datetime : Factor w/ 10886 levels "2011-01-01 00:00:00",..: 1 2 3 4 5 6 7 8 9 10 ...## $ season : int 1 1 1 1 1 1 1 1 1 1 ...## $ holiday : int 0 0 0 0 0 0 0 0 0 0 ...## $ workingday: int 0 0 0 0 0 0 0 0 0 0 ...## $ weather : int 1 1 1 1 1 2 1 1 1 1 ...## $ temp : num 9.84 9.02 9.02 9.84 9.84 ...## $ atemp : num 14.4 13.6 13.6 14.4 14.4 ...## $ humidity : int 81 80 80 75 75 75 80 86 75 76 ...## $ windspeed : num 0 0 0 0 0 ...## $ casual : int 3 8 5 3 0 0 2 1 1 8 ...## $ registered: int 13 32 27 10 1 1 0 2 7 6 ...## $ count : int 16 40 32 13 1 1 2 3 8 14 ...

#create time column by stripping out timestamptrain_factor$time <- substring(train$datetime,12,20)train_factor$month <- substring(train$datetime,6,7)train_factor$year <- substring(train$datetime,1,4)

#factorize new timestamp columntrain_factor$time <- factor(train_factor$time)train_factor$month <- factor(train_factor$month)train_factor$year <- factor(train_factor$year)

#create day of week columntrain_factor$day <- weekdays(as.Date(train_factor$datetime))train_factor$day <- as.factor(train_factor$day)

#convert time and create $hour as integer to evaluatetrain_factor$hour<- as.numeric(substr(train_factor$time,1,2))

#convert hour back to factortrain_factor$hour <- as.factor(train_factor$hour)

Page 4: 6101-Project Report

#exclude "registered" and "casual"train_factor$registered <- NULLtrain_factor$casual <- NULL

#Assign training data and test datasmp_size <- floor(0.7 * nrow(train_factor))set.seed(123)train_index <- sample(seq_len(nrow(train_factor)), size = smp_size)test_data = train_factor[-train_index,]train_data = train_factor[train_index,]

b. Data Visualization

Figure 3.b.1 Counts vs. Hours and Seasons Plot. Fall and Summer have the most usage of bike and 8:00am and 17:00pm is the peak hours of a day.

Page 5: 6101-Project Report

Figure 3.b.2 Counts vs. Hours and Weekday Plot. Thursday, Tuesday and Wednesday have the most bikes usage during 8:00am and 17:00pm peak hours; Sunday and Saturday have different peak hour which is 13:00pm to 15:00pm.

Figure 3.b.3 Counts vs. Hours and Temperature Plot. Peak hour is consist with previous charts which is 8:00am and 17:00pm. And it seems the higher temperature, the more bikes were rented.

4. Feature Selection

Page 6: 6101-Project Report

Feature selection is a very important step before any model analysis. We have performed three methods, which are Boruta, Random Forest, and StepAIC respectively. The results for each model are as following:

a. BORUTA

Figure 4.a.1. Boruta Feature Selection. The feature importance decrescent: hour, year, humidity, workingday, temp, month, atemp, weather, day, season, windspeed, holiday

b. Random Forest

Figure 4.b.1 Random Forest Feature Selection. The feature importance decrescent: hour, year, workingday, month, humidity, atemp, temp, season, windspeed, holiday

c. StepAIC

Page 7: 6101-Project Report

Figure 4.c.1. StepAIC Feature Selection. The feature importance decrescent: hour, year, month, weather, day, humidity, atemp, windspeed, temp

5. Model Analysis

a. simple decision tree

library(rpart)fit.random_tree <- rpart(formula1, data=train_data)predict.rtree <- predict(fit.random_tree,test_data)

a<-as.data.frame(predict.rtree)

MSE_rtree <- sqrt(colSums((a -test_data$count)^2)/length(test_data$count))MSE_rtree

## predict.rtree ## 92.52884

RMSLE_rtree <- sqrt(colSums((log(a+1) - log(test_data$count+1))^2)/length(test_data$count))RMSLE_rtree

## predict.rtree ## 0.8302603

plot(test_data$count, predict.rtree)

Page 8: 6101-Project Report

a1 <-cbind(actual=test_data$count,predit=predict.rtree)b1<-as.data.frame(a1)ggplot(data=b1,aes(x=actual,y=predit))+ theme_light(base_size=20) + geom_jitter(width = 0.5, height = 0.5)+ geom_point(color='red')+geom_abline(color='blue',size=.8)+ xlab("actual") + ylab("rtree_prediction")

Page 9: 6101-Project Report

b. Conditional Interference tree

#Model Selection------------------------------------------------------------------------------------################################Conditional Inference Tree###################################################3library('party')

## Warning: package 'party' was built under R version 3.3.2

## Loading required package: grid

## Loading required package: mvtnorm

## Loading required package: modeltools

## Loading required package: stats4

## Loading required package: strucchange

## Warning: package 'strucchange' was built under R version 3.3.2

## Loading required package: zoo

## Warning: package 'zoo' was built under R version 3.3.2

Page 10: 6101-Project Report

## ## Attaching package: 'zoo'

## The following objects are masked from 'package:base':## ## as.Date, as.Date.numeric

## Loading required package: sandwich

## Warning: package 'sandwich' was built under R version 3.3.2

#create our formulaformula1 <- count ~ season + holiday + workingday + weather + temp + atemp + humidity + hour + windspeed + day + year + hourcolnames(train_data)

## [1] "datetime" "season" "holiday" "workingday" "weather" ## [6] "temp" "atemp" "humidity" "windspeed" "count" ## [11] "time" "month" "year" "day" "hour"

#build our modelfit.ctree <- ctree(formula1, data=train_data)

#run model against test data sethead(test_data)

## datetime season holiday workingday weather temp atemp## 2 2011-01-01 01:00:00 1 0 0 1 9.02 13.635## 15 2011-01-01 14:00:00 1 0 0 2 18.86 22.725## 18 2011-01-01 17:00:00 1 0 0 2 18.04 21.970## 19 2011-01-01 18:00:00 1 0 0 3 17.22 21.210## 21 2011-01-01 20:00:00 1 0 0 2 16.40 20.455## 23 2011-01-01 22:00:00 1 0 0 2 16.40 20.455## humidity windspeed count time month year day hour## 2 80 0.0000 40 01:00:00 01 2011 Saturday 1## 15 72 19.0012 106 14:00:00 01 2011 Saturday 14## 18 82 19.0012 67 17:00:00 01 2011 Saturday 17## 19 88 16.9979 35 18:00:00 01 2011 Saturday 18## 21 87 16.9979 36 20:00:00 01 2011 Saturday 20## 23 94 15.0013 28 22:00:00 01 2011 Saturday 22

Page 11: 6101-Project Report

predict.ctree <- predict(fit.ctree, test_data)MSE_ctree <- sqrt(colSums((predict.ctree - test_data$count)^2)/length(test_data$count))MSE_ctree

## count ## 65.18647

RMSLE_ctree <- sqrt(colSums((log(predict.ctree+1) - log(test_data$count+1))^2)/length(test_data$count))RMSLE_ctree

## count ## 0.4516775

plot(test_data$count,predict.ctree)

a <-cbind(actual=test_data$count,predit=predict.ctree)b<-as.data.frame(a)library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.2

ggplot(data=b,aes(x=actual,y=count))+ theme_light(base_size=20) + geom_jitter(width = 0.5, height = 0.5)+geom_point(color='red')+geom_abline(color='blue',size=.8)+

Page 12: 6101-Project Report

xlab("actual") +ylab("ctree_prediction")

c. Random Forest

## Lets put multiple trees there :library(caret)

## Loading required package: lattice

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## ## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':## ## margin

rf<- randomForest(formula1,data=train_data,ntree=500,importance=T)

predic_rf_tree<-predict(rf,test_data)

Page 13: 6101-Project Report

c<-as.data.frame(predic_rf_tree)

MSE_rf <- sqrt(colSums((c - test_data$count)^2)/length(test_data$count))MSE_rf

## predic_rf_tree ## 55.75266

RMSLE_rf <- sqrt(colSums((log(c+1) - log(test_data$count+1))^2)/length(test_data$count))RMSLE_rf

## predic_rf_tree ## 0.6640107

plot(test_data$count,predic_rf_tree)

a2 <-cbind(actual=test_data$count,predit=predic_rf_tree)b2<-as.data.frame(a2)ggplot(data=b2,aes(x=actual,y=predit))+ theme_light(base_size=20) + geom_jitter(width = 0.5, height = 0.5)+ geom_point(color='red')+geom_abline(color='blue',size=.8)+ xlab("actual") + ylab("rf_prediction")

Page 14: 6101-Project Report

d. Linear Model

The linear model families performed are linear model with some transformations, LASSO (Least Absolute Shrinkage and Selection Operator), GLS (Generalized Least Squares), GAM (Generalized Additive Model). Before applying the interaction terms, the best model in linear models families is GAM.

It is obvious that there are interaction functions between variables, after a testing of all the variables, 30 out of 36 interaction terms have significant small p-values. After applied the interaction terms in the GAM. We can find GAM with Interaction terms is the best model in this project which produce the smallest RMSE.

#################################Linear Model#######################################GAMlibrary(mgcv)

## Warning: package 'mgcv' was built under R version 3.3.2

## Loading required package: nlme

## This is mgcv 1.8-16. For overview type 'help("mgcv-package")'.

train_gam1<-gam(sign(count)*log(1+abs(count))~ hour + atemp + year + month + weather + s(humidity) + day + s(windspeed) +s(temp),data=train_data)summary(train_gam1)

Page 15: 6101-Project Report

## ## Family: gaussian ## Link function: identity ## ## Formula:## sign(count) * log(1 + abs(count)) ~ hour + atemp + year + month + ## weather + s(humidity) + day + s(windspeed) + s(temp)## ## Parametric coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 3.087535 0.121694 25.371 < 2e-16 ***## hour1 -0.648452 0.045479 -14.258 < 2e-16 ***## hour2 -1.111347 0.045509 -24.421 < 2e-16 ***## hour3 -1.572124 0.046460 -33.839 < 2e-16 ***## hour4 -1.852451 0.045615 -40.611 < 2e-16 ***## hour5 -0.892272 0.045420 -19.645 < 2e-16 ***## hour6 0.278044 0.045221 6.149 8.22e-10 ***## hour7 1.213440 0.045456 26.695 < 2e-16 ***## hour8 1.846469 0.045098 40.944 < 2e-16 ***## hour9 1.530683 0.044975 34.034 < 2e-16 ***## hour10 1.230857 0.045794 26.878 < 2e-16 ***## hour11 1.313468 0.045819 28.666 < 2e-16 ***## hour12 1.529177 0.045699 33.462 < 2e-16 ***## hour13 1.519830 0.046315 32.815 < 2e-16 ***## hour14 1.432522 0.046937 30.520 < 2e-16 ***## hour15 1.478896 0.047065 31.422 < 2e-16 ***## hour16 1.731892 0.046766 37.034 < 2e-16 ***## hour17 2.156625 0.046373 46.506 < 2e-16 ***## hour18 2.057871 0.046746 44.023 < 2e-16 ***## hour19 1.745562 0.045588 38.290 < 2e-16 ***## hour20 1.452439 0.045191 32.140 < 2e-16 ***## hour21 1.204259 0.044597 27.003 < 2e-16 ***## hour22 0.956836 0.045153 21.191 < 2e-16 ***## hour23 0.566113 0.045122 12.546 < 2e-16 ***## atemp 0.006809 0.004682 1.454 0.145953 ## year2012 0.460834 0.013479 34.189 < 2e-16 ***## month02 0.146098 0.033323 4.384 1.18e-05 ***## month03 0.206472 0.036141 5.713 1.15e-08 ***## month04 0.358553 0.038551 9.301 < 2e-16 ***## month05 0.546492 0.042975 12.717 < 2e-16 ***## month06 0.506235 0.048497 10.438 < 2e-16 ***## month07 0.491903 0.054903 8.960 < 2e-16 ***## month08 0.476436 0.053332 8.933 < 2e-16 ***## month09 0.541083 0.047280 11.444 < 2e-16 ***## month10 0.646250 0.041044 15.745 < 2e-16 ***## month11 0.644651 0.036188 17.814 < 2e-16 ***## month12 0.657331 0.035429 18.554 < 2e-16 ***## weather2 -0.041873 0.016305 -2.568 0.010247 * ## weather3 -0.426053 0.029246 -14.568 < 2e-16 ***## weather4 -0.026281 0.571711 -0.046 0.963336

Page 16: 6101-Project Report

## dayMonday -0.148667 0.024890 -5.973 2.44e-09 ***## daySaturday 0.010959 0.024522 0.447 0.654968 ## daySunday -0.100481 0.024424 -4.114 3.93e-05 ***## dayThursday -0.082825 0.024651 -3.360 0.000784 ***## dayTuesday -0.166900 0.024861 -6.713 2.04e-11 ***## dayWednesday -0.141635 0.024816 -5.707 1.19e-08 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Approximate significance of smooth terms:## edf Ref.df F p-value ## s(humidity) 8.736 8.968 21.692 < 2e-16 ***## s(windspeed) 4.178 5.100 4.587 0.000333 ***## s(temp) 5.538 6.627 17.030 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## R-sq.(adj) = 0.837 Deviance explained = 83.8%## GCV = 0.32687 Scale est. = 0.3241 n = 7620

pred_gam1<-as.numeric(predict(train_gam1,newdata= test_data))pred_gam1<-exp(pred_gam1)array_test<-array(test_data$count,dim=c(3266,1))

MSE_gam1 <- sqrt(colSums((pred_gam1- array_test)^2)/length(array_test))MSE_gam1

## [1] 93.09239

RMSLE_gam1 <- sqrt(colSums((log((pred_gam1+1)) - log(array_test+1))^2)/length(array_test))RMSLE_gam1

## [1] 0.5776358

plot(test_data$count,pred_gam1)

Page 17: 6101-Project Report

a3 <-cbind(actual=test_data$count,predit=pred_gam1)b3<-as.data.frame(a3)ggplot(data=b3,aes(x=actual,y=predit))+ theme_light(base_size=20) + geom_jitter(width = 0.5, height = 0.5)+ geom_point(color='red')+geom_abline(color='blue',size=.8)+ xlab("actual") + ylab("gam_prediction")

Page 18: 6101-Project Report

#[1] 0.5776358

#GAM with full interaction terms

train_gam3<-gam(sign(count)*log(1+abs(count))~ hour+day+month+year+ atemp+ weather+ s(humidity)+ s(windspeed)+s(temp)+hour:year+hour:day+ hour:month+day:year+month:year+hour:atemp+hour:humidity+hour:windspeed+hour:temp+day:atemp+ day:humidity+day:windspeed+day:temp+month:atemp+month:humidity+month:windspeed+month:temp+ year:atemp+year:humidity+year:temp+atemp:weather+atemp:humidity+atemp:windspeed+atemp:temp+ weather:humidity+weather:temp+weather:windspeed+humidity:windspeed+ humidity:temp+windspeed:temp,data=train_data)

summary(train_gam3)

## ## Family: gaussian ## Link function: identity ## ## Formula:## sign(count) * log(1 + abs(count)) ~ hour + day + month + year +

Page 19: 6101-Project Report

## atemp + weather + s(humidity) + s(windspeed) + s(temp) + ## hour:year + hour:day + hour:month + day:year + month:year + ## hour:atemp + hour:humidity + hour:windspeed + hour:temp + ## day:atemp + day:humidity + day:windspeed + day:temp + month:atemp + ## month:humidity + month:windspeed + month:temp + year:atemp + ## year:humidity + year:temp + atemp:weather + atemp:humidity + ## atemp:windspeed + atemp:temp + weather:humidity + weather:temp + ## weather:windspeed + humidity:windspeed + humidity:temp + ## windspeed:temp## ## Parametric coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -6.917e-03 3.374e-02 -0.205 0.837565 ## hour1 -6.701e-01 1.692e-01 -3.961 7.55e-05 ***## hour2 -1.127e+00 1.763e-01 -6.393 1.73e-10 ***## hour3 -1.803e+00 1.791e-01 -10.063 < 2e-16 ***## hour4 -1.914e+00 1.765e-01 -10.843 < 2e-16 ***## hour5 -9.062e-01 1.755e-01 -5.162 2.51e-07 ***## hour6 3.588e-01 1.878e-01 1.911 0.056077 . ## hour7 1.511e+00 1.689e-01 8.944 < 2e-16 ***## hour8 2.256e+00 1.672e-01 13.490 < 2e-16 ***## hour9 1.938e+00 1.695e-01 11.436 < 2e-16 ***## hour10 1.366e+00 1.705e-01 8.010 1.34e-15 ***## hour11 1.443e+00 1.736e-01 8.314 < 2e-16 ***## hour12 1.722e+00 1.678e-01 10.261 < 2e-16 ***## hour13 1.672e+00 1.761e-01 9.492 < 2e-16 ***## hour14 1.669e+00 1.734e-01 9.621 < 2e-16 ***## hour15 1.878e+00 1.759e-01 10.673 < 2e-16 ***## hour16 2.053e+00 1.790e-01 11.474 < 2e-16 ***## hour17 2.563e+00 1.729e-01 14.821 < 2e-16 ***## hour18 2.338e+00 1.714e-01 13.636 < 2e-16 ***## hour19 1.924e+00 1.669e-01 11.524 < 2e-16 ***## hour20 1.448e+00 1.692e-01 8.555 < 2e-16 ***## hour21 9.849e-01 1.650e-01 5.970 2.49e-09 ***## hour22 1.021e+00 1.709e-01 5.973 2.45e-09 ***## hour23 6.644e-01 1.688e-01 3.937 8.32e-05 ***## dayMonday -6.055e-01 1.029e-01 -5.883 4.22e-09 ***## daySaturday 3.731e-01 1.006e-01 3.708 0.000211 ***## daySunday 4.133e-01 1.020e-01 4.053 5.10e-05 ***## dayThursday -3.308e-01 1.026e-01 -3.225 0.001267 ** ## dayTuesday -6.425e-01 1.022e-01 -6.289 3.38e-10 ***## dayWednesday -2.919e-01 1.050e-01 -2.780 0.005455 ** ## month02 -1.113e-01 1.370e-01 -0.813 0.416300 ## month03 4.190e-02 1.406e-01 0.298 0.765671 ## month04 2.126e-01 1.734e-01 1.226 0.220152 ## month05 4.821e-01 2.369e-01 2.035 0.041922 * ## month06 1.265e-02 3.123e-01 0.040 0.967699 ## month07 5.085e-01 4.373e-01 1.163 0.244951

Page 20: 6101-Project Report

## month08 -7.012e-01 4.280e-01 -1.638 0.101391 ## month09 3.591e-01 2.805e-01 1.281 0.200405 ## month10 8.774e-01 2.197e-01 3.993 6.58e-05 ***## month11 6.011e-01 1.562e-01 3.848 0.000120 ***## month12 7.803e-01 1.435e-01 5.438 5.59e-08 ***## year2012 6.162e-01 5.966e-02 10.328 < 2e-16 ***## atemp 4.550e-02 3.250e-02 1.400 0.161562 ## weather2 1.168e-01 5.473e-02 2.135 0.032785 * ## weather3 2.659e-02 1.268e-01 0.210 0.833905 ## weather4 -4.147e-05 4.158e-05 -0.997 0.318661 ## hour1:year2012 -8.871e-03 4.984e-02 -0.178 0.858752 ## hour2:year2012 -1.712e-01 4.944e-02 -3.462 0.000539 ***## hour3:year2012 -1.257e-01 5.044e-02 -2.493 0.012683 * ## hour4:year2012 -1.236e-01 4.977e-02 -2.482 0.013072 * ## hour5:year2012 1.703e-01 4.972e-02 3.426 0.000616 ***## hour6:year2012 1.276e-01 4.931e-02 2.587 0.009707 ** ## hour7:year2012 2.066e-01 4.945e-02 4.178 2.98e-05 ***## hour8:year2012 1.599e-01 4.899e-02 3.263 0.001108 ** ## hour9:year2012 1.179e-01 4.876e-02 2.418 0.015644 * ## hour10:year2012 1.021e-01 4.946e-02 2.064 0.039015 * ## hour11:year2012 9.192e-02 4.962e-02 1.852 0.064029 . ## hour12:year2012 8.333e-02 4.942e-02 1.686 0.091809 . ## hour13:year2012 8.128e-02 5.006e-02 1.624 0.104521 ## hour14:year2012 7.087e-02 5.104e-02 1.388 0.165069 ## hour15:year2012 9.774e-02 5.090e-02 1.920 0.054867 . ## hour16:year2012 7.577e-02 5.077e-02 1.492 0.135622 ## hour17:year2012 2.632e-02 5.022e-02 0.524 0.600245 ## hour18:year2012 6.278e-02 5.099e-02 1.231 0.218280 ## hour19:year2012 6.382e-02 4.995e-02 1.278 0.201350 ## hour20:year2012 3.992e-02 4.922e-02 0.811 0.417352 ## hour21:year2012 5.207e-02 4.853e-02 1.073 0.283276 ## hour22:year2012 2.773e-02 4.914e-02 0.564 0.572548 ## hour23:year2012 7.591e-03 4.923e-02 0.154 0.877459 ## hour1:dayMonday -5.560e-02 9.281e-02 -0.599 0.549092 ## hour2:dayMonday 4.737e-02 9.309e-02 0.509 0.610875 ## hour3:dayMonday 2.540e-01 9.348e-02 2.717 0.006595 ** ## hour4:dayMonday 5.904e-01 8.992e-02 6.566 5.53e-11 ***## hour5:dayMonday 2.663e-01 8.959e-02 2.973 0.002964 ** ## hour6:dayMonday 3.383e-01 8.865e-02 3.816 0.000137 ***## hour7:dayMonday 3.813e-01 8.897e-02 4.286 1.85e-05 ***## hour8:dayMonday 2.748e-01 9.159e-02 3.000 0.002710 ** ## hour9:dayMonday 3.329e-01 9.310e-02 3.576 0.000352 ***## hour10:dayMonday 3.580e-01 9.111e-02 3.930 8.59e-05 ***## hour11:dayMonday 3.844e-01 9.065e-02 4.241 2.26e-05 ***## hour12:dayMonday 4.185e-01 8.882e-02 4.712 2.50e-06 ***## hour13:dayMonday 3.675e-01 9.289e-02 3.957 7.67e-05 ***## hour14:dayMonday 4.094e-01 9.335e-02 4.385 1.17e-05 ***## hour15:dayMonday 3.748e-01 9.350e-02 4.008 6.18e-05 ***## hour16:dayMonday 4.120e-01 9.202e-02 4.477 7.69e-06 ***## hour17:dayMonday 4.999e-01 9.247e-02 5.406 6.65e-08 ***

Page 21: 6101-Project Report

## hour18:dayMonday 6.412e-01 9.078e-02 7.063 1.78e-12 ***## hour19:dayMonday 5.618e-01 9.142e-02 6.145 8.43e-10 ***## hour20:dayMonday 6.488e-01 8.983e-02 7.222 5.65e-13 ***## hour21:dayMonday 4.991e-01 8.917e-02 5.597 2.27e-08 ***## hour22:dayMonday 1.327e-01 9.073e-02 1.462 0.143706 ……

## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Approximate significance of smooth terms:## edf Ref.df F p-value ## s(humidity) 8.473 8.582 44.022 < 2e-16 ***## s(windspeed) 4.827 5.784 4.338 0.000297 ***## s(temp) 5.184 6.485 6.500 3.71e-07 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Rank: 681/688## R-sq.(adj) = 0.955 Deviance explained = 95.9%## GCV = 0.097787 Scale est. = 0.089137 n = 7620

pred_gam3<-as.numeric(predict(train_gam3,newdata= test_data))pred_gam3<-exp(pred_gam3)array_test<-array(test_data$count,dim=c(3266,1))RMSLE_gam3 <- sqrt(colSums((log((pred_gam3+1)) - log(array_test+1))^2)/length(array_test))RMSLE_gam3#[1] 0.3215348

## [1] 0.3215348

MSE2 <- sqrt(colSums((pred_gam3 - array_test)^2)/length(array_test))MSE2 #[1] 48.92687

## [1] 48.92687

plot(test_data$count,pred_gam3)

Page 22: 6101-Project Report

a4 <-cbind(actual=test_data$count,predit=pred_gam3)b4<-as.data.frame(a4)ggplot(data=b4,aes(x=actual,y=predit))+ theme_light(base_size=20) + geom_jitter(width = 0.5, height = 0.5)+ geom_point(color='red')+geom_abline(color='blue',size=.8)+ xlab("actual") + ylab("gam_int_prediction")

Page 23: 6101-Project Report

e. Another Approach - Registered users and Casual users

Since we believe that the characteristics of a registered user is different from a causal user, so we have performed another method which is to predict registered count and causal count separately, and then add the two predicted values together as the final ‘count’.

#Model Selection------------------------------------------------------------------------------------################################Conditional Inference Tree###################################################3library('party')

library(ggplot2)

## Warning: package 'ggplot2' was built under R version 3.3.2

#create our formulaformula1 <- registered ~ season + holiday + workingday + weather + temp + atemp + humidity + hour + windspeed + day + year + hourformula2 <- casual ~ season + holiday + workingday + weather + temp + atemp + humidity + hour + windspeed + day + year + hourcolnames(train_data)

## [1] "datetime" "season" "holiday" "workingday" "weather"

## [6] "temp" "atemp" "humidity" "windspeed" "casual"

## [11] "registered" "count" "time" "month" "year"

## [16] "day" "hour"

#build our model, 1 for registered, 2 for casualfit.ctree1 <- ctree(formula1, data=train_data)fit.ctree2 <- ctree(formula2, data=train_data)

predict.ctree1 <- predict(fit.ctree1, test_data)predict.ctree2 <- predict(fit.ctree2, test_data)MSE1_ctree <- sqrt(colSums((predict.ctree1 - test_data$registered)^2)/length(test_data$registered))MSE1_ctree

## registered ## 48.67681

MSE2_ctree <- sqrt(colSums((predict.ctree2 - test_data$casual)^2)/length(test_data$casual))MSE2_ctree

Page 24: 6101-Project Report

## casual ## 19.96793

total_MSE_ctree <- sqrt(colSums((predict.ctree1 + predict.ctree2 - test_data$count)^2)/length(test_data$count))total_MSE_ctree

## registered ## 56.18037

RMSLE1_ctree <- sqrt(colSums((log(predict.ctree1+1) - log(test_data$registered+1))^2)/length(test_data$registered))RMSLE1_ctree

## registered ## 0.4215073

RMSLE2_ctree <- sqrt(colSums((log(predict.ctree2+1) - log(test_data$casual+1))^2)/length(test_data$casual))RMSLE2_ctree

## casual ## 0.6089925

total_RMSLE_ctree <- sqrt(colSums((log(predict.ctree1 + predict.ctree2 +1) - log(test_data$count+1))^2)/length(test_data$count))total_RMSLE_ctree

## registered ## 0.4139103

mean(test_data$count)

## [1] 188.5585

plot(test_data$count,predict.ctree1 + predict.ctree2)

Page 25: 6101-Project Report

a <-cbind(actual=test_data$count,predit=predict.ctree1 + predict.ctree2)b<-as.data.frame(a)colnames(b) <- c("actual","predit")ggplot(data=b,aes(x=actual,y=predit))+ theme_light(base_size=20) + geom_jitter(width = 0.5, height = 0.5)+ geom_point(color='red')+geom_abline(color='blue',size=.8)+ xlab("actual") + ylab("ctree2_prediction")

Page 26: 6101-Project Report

#############################random Forest#########################################################

library(rpart)fit.random_tree1 <- rpart(formula1, data=train_data)fit.random_tree2 <- rpart(formula2, data=train_data)predict.rtree1 <- predict(fit.random_tree1,test_data)predict.rtree2 <- predict(fit.random_tree2,test_data)a1<-as.data.frame(predict.rtree1)a2<-as.data.frame(predict.rtree2)MSE_rtree1 <- sqrt(colSums((a1 - test_data$registered)^2)/length(test_data$registered))MSE_rtree1

## predict.rtree1 ## 74.36998

MSE_rtree2 <- sqrt(colSums((a2 - test_data$casual)^2)/length(test_data$casual))MSE_rtree2

## predict.rtree2 ## 24.61291

total_MSE_rtree <- sqrt(colSums((a1+a2 - test_data$count)^2)/length(test_data$count))total_MSE_rtree

Page 27: 6101-Project Report

## predict.rtree1 ## 84.30949

RMSLE_rtree1 <- sqrt(colSums((log(a1+1) - log(test_data$registered+1))^2)/length(test_data$registered))RMSLE_rtree1

## predict.rtree1 ## 0.9382638

RMSLE_rtree2 <- sqrt(colSums((log(a2+1) - log(test_data$casual+1))^2)/length(test_data$casual))RMSLE_rtree2

## predict.rtree2 ## 0.9901855

total_RMSLE_rtree <- sqrt(colSums((log(a1+a2+1) - log(test_data$count+1))^2)/length(test_data$count))total_RMSLE_rtree

## predict.rtree1 ## 0.9295162

plot(test_data$count,predict.rtree1 + predict.rtree2)

a1 <-cbind(actual=test_data$count,predit=predict.rtree1 + predict.rtree2)

Page 28: 6101-Project Report

b1<-as.data.frame(a1)colnames(b1) <- c("actual","predit")ggplot(data=b1,aes(x=actual,y=predit))+ theme_light(base_size=20) + geom_jitter(width = 0.5, height = 0.5)+ geom_point(color='red')+geom_abline(color='blue',size=.8)+ xlab("actual") + ylab("rtree2_prediction")

## Lets put multiple trees there :library(caret)

## Loading required package: lattice

library(randomForest)

## randomForest 4.6-12

## Type rfNews() to see new features/changes/bug fixes.

## ## Attaching package: 'randomForest'

## The following object is masked from 'package:ggplot2':## ## margin

rf1<- randomForest(formula1,data=train_data,ntree=500,importance=T)rf2<- randomForest(formula2,data=train_data,ntree=500,importance=T)

Page 29: 6101-Project Report

predic_rf_tree1<-predict(rf1,test_data)

predic_rf_tree2<-predict(rf2,test_data)

c1<-as.data.frame(predic_rf_tree1)c2<-as.data.frame(predic_rf_tree2)

MSE_rf1 <- sqrt(colSums((c1 - test_data$registered)^2)/length(test_data$registered))MSE_rf1

## predic_rf_tree1 ## 43.6543

MSE_rf2 <- sqrt(colSums((c2 - test_data$casual)^2)/length(test_data$casual))MSE_rf2

## predic_rf_tree2 ## 15.6124

total_MSE_rf <- sqrt(colSums((c1+c2 - test_data$count)^2)/length(test_data$count))total_MSE_rf

## predic_rf_tree1 ## 50.69103

RMSLE_rf1 <- sqrt(colSums((log(c1+1) - log(test_data$registered+1))^2)/length(test_data$registered))RMSLE_rf1

## predic_rf_tree1 ## 0.6794298

RMSLE_rf2 <- sqrt(colSums((log(c2+1) - log(test_data$casual+1))^2)/length(test_data$casual))RMSLE_rf2

## predic_rf_tree2 ## 0.6692723

total_RMSLE_rf <- sqrt(colSums((log(c1+c2+1) - log(test_data$count+1))^2)/length(test_data$count))total_RMSLE_rf

## predic_rf_tree1 ## 0.6706211

plot(test_data$count,predic_rf_tree1 + predic_rf_tree2)

Page 30: 6101-Project Report

a2 <-cbind(actual=test_data$count,predit=predic_rf_tree1 + predic_rf_tree2)b2<-as.data.frame(a2)colnames(b2) <- c("actual","predit")ggplot(data=b2,aes(x=actual,y=predit))+ theme_light(base_size=20) + geom_jitter(width = 0.5, height = 0.5)+ geom_point(color='red')+geom_abline(color='blue',size=.8)+ xlab("actual") + ylab("rf2_prediction")

Page 31: 6101-Project Report

#################################Linear Model#######################################GAMlibrary(mgcv)

## Warning: package 'mgcv' was built under R version 3.3.2

## Loading required package: nlme

## This is mgcv 1.8-16. For overview type 'help("mgcv-package")'.

train_gam1_1<-gam(sign(registered)*log(1+abs(registered))~ hour + atemp + year + month + weather + s(humidity) + day + s(windspeed) +s(temp),data=train_data)train_gam1_2<-gam(sign(casual)*log(1+abs(casual))~ hour + atemp + year + month + weather + s(humidity) + day + s(windspeed) +s(temp),data=train_data)

summary(train_gam1_1)

## ## Family: gaussian ## Link function: identity ## ## Formula:## sign(registered) * log(1 + abs(registered)) ~ hour + atemp + ## year + month + weather + s(humidity) + day + s(windspeed) + ## s(temp)

Page 32: 6101-Project Report

## ## Parametric coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 2.939528 0.123869 23.731 < 2e-16 ***## hour1 -0.658965 0.046332 -14.223 < 2e-16 ***## hour2 -1.157278 0.046361 -24.962 < 2e-16 ***## hour3 -1.565177 0.047330 -33.069 < 2e-16 ***## hour4 -1.844296 0.046470 -39.688 < 2e-16 ***## hour5 -0.797460 0.046270 -17.235 < 2e-16 ***## hour6 0.363031 0.046069 7.880 3.72e-15 ***## hour7 1.302448 0.046306 28.127 < 2e-16 ***## hour8 1.932096 0.045944 42.054 < 2e-16 ***## hour9 1.568307 0.045817 34.229 < 2e-16 ***## hour10 1.170001 0.046650 25.081 < 2e-16 ***## hour11 1.239613 0.046672 26.560 < 2e-16 ***## hour12 1.480574 0.046549 31.807 < 2e-16 ***## hour13 1.458701 0.047178 30.919 < 2e-16 ***## hour14 1.342673 0.047812 28.082 < 2e-16 ***## hour15 1.403235 0.047937 29.272 < 2e-16 ***## hour16 1.716951 0.047634 36.045 < 2e-16 ***## hour17 2.192092 0.047229 46.415 < 2e-16 ***## hour18 2.110938 0.047611 44.337 < 2e-16 ***## hour19 1.800422 0.046432 38.775 < 2e-16 ***## hour20 1.498993 0.046030 32.565 < 2e-16 ***## hour21 1.247215 0.045431 27.453 < 2e-16 ***## hour22 0.997193 0.046000 21.678 < 2e-16 ***## hour23 0.588747 0.045969 12.807 < 2e-16 ***## atemp 0.005485 0.004763 1.151 0.2496 ## year2012 0.494020 0.013719 36.011 < 2e-16 ***## month02 0.145554 0.033885 4.296 1.76e-05 ***## month03 0.156233 0.036720 4.255 2.12e-05 ***## month04 0.272416 0.039183 6.952 3.89e-12 ***## month05 0.496453 0.043682 11.365 < 2e-16 ***## month06 0.506213 0.049299 10.268 < 2e-16 ***## month07 0.474774 0.055834 8.503 < 2e-16 ***## month08 0.486382 0.054189 8.976 < 2e-16 ***## month09 0.529848 0.048071 11.022 < 2e-16 ***## month10 0.625787 0.041736 14.994 < 2e-16 ***## month11 0.629557 0.036657 17.174 < 2e-16 ***## month12 0.682515 0.035943 18.989 < 2e-16 ***## weather2 -0.029532 0.016606 -1.778 0.0754 . ## weather3 -0.397028 0.029780 -13.332 < 2e-16 ***## weather4 -0.057407 0.582425 -0.099 0.9215 ## dayMonday -0.171355 0.025347 -6.760 1.48e-11 ***## daySaturday -0.123989 0.024978 -4.964 7.06e-07 ***## daySunday -0.232855 0.024877 -9.360 < 2e-16 ***## dayThursday -0.051312 0.025111 -2.043 0.0410 * ## dayTuesday -0.148742 0.025313 -5.876 4.38e-09 ***## dayWednesday -0.108678 0.025271 -4.301 1.73e-05 ***## ---

Page 33: 6101-Project Report

## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Approximate significance of smooth terms:## edf Ref.df F p-value ## s(humidity) 8.731 8.966 17.946 < 2e-16 ***## s(windspeed) 3.717 4.592 3.373 0.00746 ** ## s(temp) 4.487 5.546 15.247 7.88e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## R-sq.(adj) = 0.826 Deviance explained = 82.7%## GCV = 0.33921 Scale est. = 0.33641 n = 7620

summary(train_gam1_2)

## ## Family: gaussian ## Link function: identity ## ## Formula:## sign(casual) * log(1 + abs(casual)) ~ hour + atemp + year + month +

## weather + s(humidity) + day + s(windspeed) + s(temp)## ## Parametric coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 0.960477 0.127698 7.521 6.04e-14 ***## hour1 -0.459840 0.047664 -9.648 < 2e-16 ***## hour2 -0.709839 0.047700 -14.881 < 2e-16 ***## hour3 -1.099662 0.048694 -22.583 < 2e-16 ***## hour4 -1.281651 0.047810 -26.807 < 2e-16 ***## hour5 -1.201221 0.047605 -25.233 < 2e-16 ***## hour6 -0.473736 0.047397 -9.995 < 2e-16 ***## hour7 0.262566 0.047642 5.511 3.68e-08 ***## hour8 0.916556 0.047264 19.392 < 2e-16 ***## hour9 1.107027 0.047142 23.483 < 2e-16 ***## hour10 1.330910 0.048006 27.724 < 2e-16 ***## hour11 1.455484 0.048045 30.294 < 2e-16 ***## hour12 1.576585 0.047921 32.900 < 2e-16 ***## hour13 1.597308 0.048561 32.893 < 2e-16 ***## hour14 1.583151 0.049207 32.174 < 2e-16 ***## hour15 1.598562 0.049361 32.385 < 2e-16 ***## hour16 1.610095 0.049033 32.837 < 2e-16 ***## hour17 1.700495 0.048632 34.966 < 2e-16 ***## hour18 1.475395 0.049018 30.099 < 2e-16 ***## hour19 1.224267 0.047807 25.609 < 2e-16 ***## hour20 1.006365 0.047381 21.240 < 2e-16 ***## hour21 0.835439 0.046749 17.871 < 2e-16 ***## hour22 0.654950 0.047328 13.838 < 2e-16 ***## hour23 0.397364 0.047289 8.403 < 2e-16 ***## atemp 0.018833 0.004921 3.827 0.000131 ***

Page 34: 6101-Project Report

## year2012 0.272740 0.014159 19.263 < 2e-16 ***## month02 0.128040 0.035041 3.654 0.000260 ***## month03 0.546315 0.038018 14.370 < 2e-16 ***## month04 0.792060 0.040552 19.532 < 2e-16 ***## month05 0.860593 0.045251 19.018 < 2e-16 ***## month06 0.619461 0.051015 12.143 < 2e-16 ***## month07 0.612443 0.057687 10.617 < 2e-16 ***## month08 0.543939 0.056099 9.696 < 2e-16 ***## month09 0.645374 0.049722 12.980 < 2e-16 ***## month10 0.807677 0.043169 18.710 < 2e-16 ***## month11 0.751861 0.038267 19.648 < 2e-16 ***## month12 0.506644 0.037343 13.567 < 2e-16 ***## weather2 -0.072939 0.017099 -4.266 2.02e-05 ***## weather3 -0.490015 0.030671 -15.977 < 2e-16 ***## weather4 -0.133736 0.599215 -0.223 0.823397 ## dayMonday -0.077915 0.026102 -2.985 0.002844 ** ## daySaturday 0.566937 0.025705 22.056 < 2e-16 ***## daySunday 0.472445 0.025603 18.452 < 2e-16 ***## dayThursday -0.257300 0.025838 -9.958 < 2e-16 ***## dayTuesday -0.283599 0.026078 -10.875 < 2e-16 ***## dayWednesday -0.301828 0.026062 -11.581 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Approximate significance of smooth terms:## edf Ref.df F p-value ## s(humidity) 8.507 8.896 32.837 < 2e-16 ***## s(windspeed) 4.551 5.499 9.437 1.74e-09 ***## s(temp) 8.462 8.895 24.825 < 2e-16 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## R-sq.(adj) = 0.84 Deviance explained = 84.1%## GCV = 0.35913 Scale est. = 0.35595 n = 7620

pred_gam1_1<-as.numeric(predict(train_gam1_1,newdata= test_data))pred_gam1_1<-exp(pred_gam1_1)pred_gam1_2<-as.numeric(predict(train_gam1_2,newdata= test_data))pred_gam1_2<-exp(pred_gam1_2)array_test1<-array(test_data$registered,dim=c(3266,1))array_test2<-array(test_data$casual,dim=c(3266,1))array_test<-array(test_data$count,dim=c(3266,1))

MSE_gam1_1 <- sqrt(colSums((pred_gam1_1 - array_test1)^2)/length(array_test1))MSE_gam1_1

## [1] 80.41881

Page 35: 6101-Project Report

MSE_gam1_2 <- sqrt(colSums((pred_gam1_2 - array_test2)^2)/length(array_test2))MSE_gam1_2

## [1] 23.19053

total_MSE_gam1 <- sqrt(colSums((pred_gam1_1+pred_gam1_2 - array_test)^2)/length(array_test))total_MSE_gam1

## [1] 90.37608

RMSLE_gam1_1 <- sqrt(colSums((log((pred_gam1_1+1)) - log(array_test1+1))^2)/length(array_test1))RMSLE_gam1_1

## [1] 0.5876982

RMSLE_gam1_2 <- sqrt(colSums((log((pred_gam1_2+1)) - log(array_test2+1))^2)/length(array_test2))RMSLE_gam1_2

## [1] 0.6250519

total_RMSLE_gam1 <- sqrt(colSums((log((pred_gam1_1+pred_gam1_2+1)) - log(array_test+1))^2)/length(array_test))total_RMSLE_gam1

## [1] 0.567598

plot(test_data$count,pred_gam1_1+pred_gam1_2)

Page 36: 6101-Project Report

a3 <-cbind(actual=test_data$count,predit=pred_gam1_1 + pred_gam1_2)b3<-as.data.frame(a3)colnames(b3) <- c("actual","predit")ggplot(data=b3,aes(x=actual,y=predit))+ theme_light(base_size=20) + geom_jitter(width = 0.5, height = 0.5)+ geom_point(color='red')+geom_abline(color='blue',size=.8)+ xlab("actual") + ylab("gam2_prediction")

Page 37: 6101-Project Report

#[1] 0.5776358

#GAM with full interaction terms

train_gam3_1<-gam(sign(registered)*log(1+abs(registered))~ hour+day+month+year+ atemp+ weather+ s(humidity)+ s(windspeed)+s(temp)+hour:year+hour:day+ hour:month+day:year+month:year+hour:atemp+hour:humidity+hour:windspeed+hour:temp+day:atemp+ day:humidity+day:windspeed+day:temp+month:atemp+month:humidity+month:windspeed+month:temp+ year:atemp+year:humidity+year:temp+atemp:weather+atemp:humidity+atemp:windspeed+atemp:temp+ weather:humidity+weather:temp+weather:windspeed+humidity:windspeed+ humidity:temp+windspeed:temp,data=train_data)train_gam3_2<-gam(sign(casual)*log(1+abs(casual))~ hour+day+month+year+ atemp+ weather+ s(humidity)+ s(windspeed)+s(temp)+hour:year+hour:day+ hour:month+day:year+month:year+hour:atemp+hour:humidity+hour:windspeed+hour:temp+day:atemp+ day:humidity+day:windspeed+day:temp+month:atemp+month:humidity+month:w

Page 38: 6101-Project Report

indspeed+month:temp+ year:atemp+year:humidity+year:temp+atemp:weather+atemp:humidity+atemp:windspeed+atemp:temp+ weather:humidity+weather:temp+weather:windspeed+humidity:windspeed+ humidity:temp+windspeed:temp,data=train_data)

summary(train_gam3_1)

## ## Family: gaussian ## Link function: identity ## ## Formula:## sign(registered) * log(1 + abs(registered)) ~ hour + day + month + ## year + atemp + weather + s(humidity) + s(windspeed) + s(temp) +

## hour:year + hour:day + hour:month + day:year + month:year + ## hour:atemp + hour:humidity + hour:windspeed + hour:temp + ## day:atemp + day:humidity + day:windspeed + day:temp + month:atemp + ## month:humidity + month:windspeed + month:temp + year:atemp + ## year:humidity + year:temp + atemp:weather + atemp:humidity + ## atemp:windspeed + atemp:temp + weather:humidity + weather:temp + ## weather:windspeed + humidity:windspeed + humidity:temp + ## windspeed:temp## ## Parametric coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) -1.725e-03 3.258e-02 -0.053 0.957773 ## hour1 -6.969e-01 1.669e-01 -4.175 3.02e-05 ***## hour2 -1.176e+00 1.740e-01 -6.760 1.49e-11 ***## hour3 -1.888e+00 1.767e-01 -10.682 < 2e-16 ***## hour4 -1.731e+00 1.741e-01 -9.942 < 2e-16 ***## hour5 -8.860e-01 1.732e-01 -5.116 3.20e-07 ***## hour6 4.362e-01 1.852e-01 2.355 0.018554 * ## hour7 1.590e+00 1.666e-01 9.542 < 2e-16 ***## hour8 2.357e+00 1.649e-01 14.287 < 2e-16 ***## hour9 1.969e+00 1.672e-01 11.778 < 2e-16 ***## hour10 1.335e+00 1.682e-01 7.934 2.45e-15 ***## hour11 1.319e+00 1.712e-01 7.703 1.51e-14 ***## hour12 1.647e+00 1.655e-01 9.953 < 2e-16 ***## hour13 1.604e+00 1.737e-01 9.233 < 2e-16 ***## hour14 1.560e+00 1.711e-01 9.120 < 2e-16 ***## hour15 1.817e+00 1.736e-01 10.467 < 2e-16 ***## hour16 2.060e+00 1.765e-01 11.670 < 2e-16 ***## hour17 2.600e+00 1.706e-01 15.241 < 2e-16 ***## hour18 2.344e+00 1.691e-01 13.858 < 2e-16 ***

Page 39: 6101-Project Report

## hour19 1.966e+00 1.647e-01 11.941 < 2e-16 ***## hour20 1.478e+00 1.670e-01 8.850 < 2e-16 ***## hour21 9.785e-01 1.628e-01 6.012 1.93e-09 ***## hour22 1.022e+00 1.686e-01 6.059 1.44e-09 ***## hour23 6.885e-01 1.665e-01 4.136 3.58e-05 ***## dayMonday -6.559e-01 1.015e-01 -6.460 1.12e-10 ***## daySaturday 4.572e-01 9.927e-02 4.606 4.18e-06 ***## daySunday 5.436e-01 1.005e-01 5.407 6.63e-08 ***## dayThursday -3.079e-01 1.012e-01 -3.042 0.002356 ** ## dayTuesday -6.673e-01 1.008e-01 -6.622 3.81e-11 ***## dayWednesday -2.438e-01 1.035e-01 -2.354 0.018579 * ## month02 -1.861e-01 1.350e-01 -1.378 0.168270 ## month03 -2.948e-02 1.386e-01 -0.213 0.831610 ## month04 5.174e-02 1.710e-01 0.303 0.762217 ## month05 3.417e-01 2.335e-01 1.464 0.143336 ## month06 1.418e-02 3.074e-01 0.046 0.963225 ## month07 3.641e-01 4.302e-01 0.846 0.397304 ## month08 -6.178e-01 4.210e-01 -1.467 0.142301 ## month09 5.703e-01 2.762e-01 2.064 0.039011 * ## month10 8.871e-01 2.165e-01 4.097 4.24e-05 ***## month11 6.444e-01 1.541e-01 4.183 2.91e-05 ***## month12 7.671e-01 1.415e-01 5.421 6.12e-08 ***## year2012 5.961e-01 5.883e-02 10.133 < 2e-16 ***## atemp 2.637e-02 3.188e-02 0.827 0.408184 ## weather2 1.243e-01 5.396e-02 2.304 0.021258 * ## weather3 5.595e-02 1.251e-01 0.447 0.654632 ## weather4 -4.457e-05 4.103e-05 -1.086 0.277320 ## hour1:year2012 -1.792e-02 4.918e-02 -0.364 0.715512 ## hour2:year2012 -1.536e-01 4.878e-02 -3.150 0.001642 ** ## hour3:year2012 -1.001e-01 4.976e-02 -2.012 0.044250 * ## hour4:year2012 -1.296e-01 4.910e-02 -2.639 0.008344 ** ## hour5:year2012 1.365e-01 4.905e-02 2.783 0.005398 ** ## hour6:year2012 9.814e-02 4.865e-02 2.017 0.043704 * ## hour7:year2012 1.620e-01 4.878e-02 3.321 0.000902 ***## hour8:year2012 1.270e-01 4.833e-02 2.627 0.008638 ** ## hour9:year2012 8.649e-02 4.811e-02 1.798 0.072253 . ## hour10:year2012 5.670e-02 4.879e-02 1.162 0.245224 ## hour11:year2012 5.466e-02 4.896e-02 1.117 0.264246 ## hour12:year2012 3.707e-02 4.876e-02 0.760 0.447032 ## hour13:year2012 3.365e-02 4.939e-02 0.681 0.495650 ## hour14:year2012 3.105e-02 5.036e-02 0.617 0.537528 ## hour15:year2012 5.989e-02 5.021e-02 1.193 0.233071 ## hour16:year2012 3.249e-02 5.009e-02 0.649 0.516521 ## hour17:year2012 -2.106e-02 4.955e-02 -0.425 0.670845 ## hour18:year2012 3.058e-02 5.030e-02 0.608 0.543235 ## hour19:year2012 3.431e-02 4.927e-02 0.696 0.486298 ## hour20:year2012 1.411e-02 4.856e-02 0.291 0.771399 ## hour21:year2012 1.525e-02 4.788e-02 0.319 0.750078 ## hour22:year2012 -6.677e-03 4.848e-02 -0.138 0.890462 ## hour23:year2012 -1.208e-02 4.857e-02 -0.249 0.803615

Page 40: 6101-Project Report

## hour1:dayMonday -1.805e-02 9.156e-02 -0.197 0.843699 ## hour2:dayMonday 9.602e-02 9.185e-02 1.045 0.295829 ## hour3:dayMonday 3.932e-01 9.223e-02 4.263 2.04e-05 ***## hour4:dayMonday 5.861e-01 8.871e-02 6.607 4.22e-11 ***## hour5:dayMonday 3.575e-01 8.840e-02 4.044 5.32e-05 ***## hour6:dayMonday 4.358e-01 8.746e-02 4.983 6.42e-07 ***## hour7:dayMonday 4.893e-01 8.778e-02 5.574 2.58e-08 ***## hour8:dayMonday 3.622e-01 9.036e-02 4.008 6.18e-05 ***## hour9:dayMonday 3.692e-01 9.185e-02 4.019 5.90e-05 ***## hour10:dayMonday 3.664e-01 8.989e-02 4.076 4.64e-05 ***……

## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Approximate significance of smooth terms:## edf Ref.df F p-value ## s(humidity) 8.483 8.583 38.678 < 2e-16 ***## s(windspeed) 3.923 4.882 2.947 0.0119 * ## s(temp) 4.719 6.042 5.154 2.78e-05 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Rank: 681/688## R-sq.(adj) = 0.955 Deviance explained = 95.9%## GCV = 0.095172 Scale est. = 0.08677 n = 7620

summary(train_gam3_2)

## ## Family: gaussian ## Link function: identity ## ## Formula:## sign(casual) * log(1 + abs(casual)) ~ hour + day + month + year + ## atemp + weather + s(humidity) + s(windspeed) + s(temp) + ## hour:year + hour:day + hour:month + day:year + month:year + ## hour:atemp + hour:humidity + hour:windspeed + hour:temp + ## day:atemp + day:humidity + day:windspeed + day:temp + month:atemp + ## month:humidity + month:windspeed + month:temp + year:atemp + ## year:humidity + year:temp + atemp:weather + atemp:humidity + ## atemp:windspeed + atemp:temp + weather:humidity + weather:temp + ## weather:windspeed + humidity:windspeed + humidity:temp + ## windspeed:temp## ## Parametric coefficients:## Estimate Std. Error t value Pr(>|t|) ## (Intercept) 1.076e-02 6.093e-02 0.177 0.859822 ## hour1 -3.055e-01 2.711e-01 -1.127 0.259885

Page 41: 6101-Project Report

## hour2 -4.000e-01 2.827e-01 -1.415 0.157123 ## hour3 -6.495e-01 2.870e-01 -2.263 0.023680 * ## hour4 -1.259e+00 2.829e-01 -4.452 8.66e-06 ***## hour5 -5.125e-01 2.814e-01 -1.821 0.068603 . ## hour6 -9.082e-01 3.010e-01 -3.018 0.002556 ** ## hour7 -1.810e-01 2.708e-01 -0.668 0.504022 ## hour8 7.401e-01 2.680e-01 2.761 0.005775 ** ## hour9 9.479e-01 2.718e-01 3.488 0.000489 ***## hour10 8.515e-01 2.734e-01 3.114 0.001850 ** ## hour11 1.692e+00 2.782e-01 6.081 1.26e-09 ***## hour12 1.522e+00 2.689e-01 5.660 1.57e-08 ***## hour13 1.543e+00 2.824e-01 5.465 4.78e-08 ***## hour14 1.697e+00 2.780e-01 6.104 1.09e-09 ***## hour15 1.901e+00 2.819e-01 6.744 1.66e-11 ***## hour16 1.585e+00 2.867e-01 5.528 3.36e-08 ***## hour17 1.473e+00 2.771e-01 5.316 1.09e-07 ***## hour18 1.232e+00 2.747e-01 4.485 7.40e-06 ***## hour19 6.988e-01 2.675e-01 2.612 0.009015 ** ## hour20 1.801e-01 2.712e-01 0.664 0.506689 ## hour21 2.446e-01 2.644e-01 0.925 0.354996 ## hour22 4.356e-01 2.739e-01 1.591 0.111755 ## hour23 1.428e-01 2.704e-01 0.528 0.597375 ## dayMonday -3.239e-01 1.650e-01 -1.963 0.049672 * ## daySaturday 1.775e-01 1.612e-01 1.101 0.271105 ## daySunday 2.737e-01 1.632e-01 1.677 0.093624 . ## dayThursday -5.330e-01 1.644e-01 -3.242 0.001191 ** ## dayTuesday -5.244e-01 1.637e-01 -3.203 0.001368 ** ## dayWednesday -7.278e-01 1.687e-01 -4.315 1.62e-05 ***## month02 -2.609e-01 2.193e-01 -1.190 0.234222 ## month03 -1.750e-02 2.257e-01 -0.078 0.938205 ## month04 3.534e-01 2.781e-01 1.271 0.203929 ## month05 5.159e-01 3.812e-01 1.353 0.175992 ## month06 -4.844e-01 5.033e-01 -0.962 0.335852 ## month07 3.562e-02 7.079e-01 0.050 0.959867 ## month08 -1.916e+00 6.918e-01 -2.770 0.005618 ** ## month09 -3.206e-01 4.519e-01 -0.709 0.478087 ## month10 4.811e-01 3.533e-01 1.362 0.173351 ## month11 -1.866e-01 2.507e-01 -0.744 0.456834 ## month12 1.123e-01 2.304e-01 0.487 0.626017 ## year2012 4.011e-01 9.592e-02 4.182 2.93e-05 ***## atemp -2.855e-02 5.421e-02 -0.527 0.598421 ## weather2 1.246e-01 8.765e-02 1.422 0.155137 ## weather3 4.192e-02 2.017e-01 0.208 0.835379 ## weather4 6.461e-06 6.664e-05 0.097 0.922762 ## hour1:year2012 8.101e-02 7.988e-02 1.014 0.310540 ## hour2:year2012 -9.629e-02 7.924e-02 -1.215 0.224321 ## hour3:year2012 -1.198e-01 8.083e-02 -1.482 0.138398 ## hour4:year2012 -6.383e-02 7.976e-02 -0.800 0.423594 ## hour5:year2012 -1.285e-02 7.967e-02 -0.161 0.871844 ## hour6:year2012 4.538e-02 7.903e-02 0.574 0.565839

Page 42: 6101-Project Report

## hour7:year2012 1.725e-01 7.923e-02 2.178 0.029464 * ## hour8:year2012 1.547e-01 7.851e-02 1.971 0.048796 * ## hour9:year2012 3.674e-01 7.817e-02 4.701 2.64e-06 ***## hour10:year2012 4.087e-01 7.927e-02 5.155 2.60e-07 ***## hour11:year2012 3.609e-01 7.954e-02 4.538 5.78e-06 ***## hour12:year2012 4.048e-01 7.920e-02 5.111 3.29e-07 ***## hour13:year2012 3.900e-01 8.023e-02 4.860 1.20e-06 ***## hour14:year2012 3.914e-01 8.182e-02 4.784 1.76e-06 ***## hour15:year2012 3.680e-01 8.158e-02 4.511 6.56e-06 ***## hour16:year2012 3.845e-01 8.135e-02 4.727 2.33e-06 ***……

## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Approximate significance of smooth terms:## edf Ref.df F p-value ## s(humidity) 8.176 8.513 28.022 < 2e-16 ***## s(windspeed) 3.199 4.114 3.728 0.00481 ** ## s(temp) 7.628 8.291 7.085 9.12e-10 ***## ---## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1## ## Rank: 681/688## R-sq.(adj) = 0.897 Deviance explained = 90.6%## GCV = 0.25112 Scale est. = 0.22889 n = 7620

pred_gam3_1<-as.numeric(predict(train_gam3_1,newdata= test_data))pred_gam3_1<-exp(pred_gam3_1)pred_gam3_2<-as.numeric(predict(train_gam3_2,newdata= test_data))pred_gam3_2<-exp(pred_gam3_2)

RMSLE_gam3_1 <- sqrt(colSums((log((pred_gam3_1+1)) - log(array_test1+1))^2)/length(array_test1))RMSLE_gam3_1

## [1] 0.3201406

RMSLE_gam3_2 <- sqrt(colSums((log((pred_gam3_2+1)) - log(array_test2+1))^2)/length(array_test2))RMSLE_gam3_2

## [1] 0.5442654

total_RMSLE_gam3 <- sqrt(colSums((log((pred_gam3_1+pred_gam3_2+1)) - log(array_test+1))^2)/length(array_test))total_RMSLE_gam3

## [1] 0.3300168

Page 43: 6101-Project Report

MSE_gam3_1 <- sqrt(colSums((pred_gam3_1 - array_test1)^2)/length(array_test1))MSE_gam3_1

## [1] 39.76552

MSE_gam3_2 <- sqrt(colSums((pred_gam3_2 - array_test2)^2)/length(array_test2))MSE_gam3_2

## [1] 18.02033

total_MSE_gam3 <- sqrt(colSums((pred_gam3_1+pred_gam3_2 - array_test)^2)/length(array_test))total_MSE_gam3

## [1] 48.31181

sigma <- sd(residual)sd(residual)

## [1] 48.30106

plot(test_data$count)

plot(pred_gam3_1+pred_gam3_2)

Page 44: 6101-Project Report

plot(test_data$count,pred_gam3_1+pred_gam3_2)

Page 45: 6101-Project Report

a4 <-cbind(actual=test_data$count,predit=pred_gam3_1 + pred_gam3_2)b4<-as.data.frame(a4)colnames(b4) <- c("actual","predit")ggplot(data=b4,aes(x=actual,y=predit))+ theme_light(base_size=20) + geom_jitter(width = 0.5, height = 0.5)+ geom_point(color='red')+geom_abline(color='blue',size=.8)+ xlab("actual") + ylab("gam2_int_prediction")

6. Model Comparison and Results

After Applied two approach, we finally get the results of various models, which is summarized in the below table:

MODELS RMSE RMSLE

GAM w/o interaction93.09239 0.5776358

R Tree 92.52884 0.8302603

GAM 2 90.37608 0.567598

Page 46: 6101-Project Report

R Tree 2 84.30949 0.9295162

C Tree 65.18647 0.4516775

C Tree 2 56.18037 0.4139103

Random Forest 56.12813 0.6703453

Random Forest2 50.50987 0.6716625

GAM_interaction 48.92687 0.3215348

GAM_interaction 2 48.31181 0.3300168

Table 6.1 Models Goodness of Fit Comparison

From the table above, it is very clear that GAM with interaction terms using approach 2, which is predicting “register” and “causal” counts separately and add the two prediction numbers together to have the final “count” result.

7. Conclusion and Lesson Learned

a. Conclusion

We investigated what factors influence bikes usage at Capital Bikeshare. Specifically, we aimed to predict the count of bikes rented by users, given independent variables such as date time, season, working day, temperature, wind speed and so on, using data from 10,887 Capital Bikeshare system records from 2011-2012. We found that the predominant factors were date and time. Thursday, Tuesday and Wednesday have the most bikes usage during 8:00am and 17:00pm peak hours (450 to 480, compare with 188 as the average of all hours); Sunday and Saturday have different peak hour which is 13:00pm to 15:00pm (about 400). We find that bike usage in Fall is higher than other seasons, while in spring bike usage is the lowest (about 35% lower than in Fall). We find that in general bike usage is higher with higher temperature day. We used decision tree, conditional inference tree, random forest, and linear models (gam and gam with interaction terms) to predict the count of bikes rented by users, using 70% of the data as a training set (random sampling), 30% remaining as testing data. It turns out that gam with interaction terms is the best model according to RMSE and RLMSE analysis. Prediction is within 20% accuracy, 63% of the time. While this accuracy seems to be low, our RLMSE indicates that it can rank as one of the top Kaggle’s results. Below are a couple of example inputs to our best model (GAM with interaction terms):

datetime season holiday workingday weather temp atemp humidity windspeed time month year day hour1/1/2011 14:00 1 0 0 2 18.86 22.725 72 19.0012 14:00:00 1 2011 Saturday 141/1/2011 17:00 1 0 0 2 18.04 21.97 82 19.0012 17:00:00 1 2011 Saturday 171/1/2011 18:00 1 0 0 3 17.22 21.21 88 16.9979 18:00:00 1 2011 Saturday 181/1/2011 22:00 1 0 0 2 16.4 20.455 94 15.0013 22:00:00 1 2011 Saturday 221/2/2011 12:00 1 0 0 2 14.76 16.665 66 19.9995 12:00:00 1 2011 Sunday 121/3/2011 12:00 1 0 1 1 9.02 10.605 35 19.9995 12:00:00 1 2011 Monday 121/3/2011 13:00 1 0 1 1 9.84 10.605 35 19.0012 13:00:00 1 2011 Monday 131/3/2011 19:00 1 0 1 1 8.2 12.88 47 0 19:00:00 1 2011 Monday 191/4/2011 19:00 1 0 1 1 9.84 12.88 48 7.0015 19:00:00 1 2011 Tuesday 19

Page 47: 6101-Project Report

And below are the corresponding actual values and our predictions:

registered registered_prediction casual casual_prediction count count_prediction71 76.76646597 35 23.19664166 106 99.9631076352 54.6122054 15 16.95911248 67 71.5713178826 28.38555553 9 6.562850847 35 34.9484063817 22.81613642 11 4.633866008 28 27.4500024373 75.83201138 20 19.24886201 93 95.0808733948 54.22904613 13 7.037887183 61 61.2669333253 51.44653473 8 6.98649298 61 58.43302771

102 105.9669525 8 3.616475508 110 109.583428110 115.4458928 2 4.629932759 112 120.0758256

From the example output we can see that our total count predictions are close to the actual total count of the rented bikes, and for registered users, we can probably predict more accurately than casual users.

In addition, from our prediction, we predicted that 2012-09-11 17:00:00 with a temperature 28.70 c degree, wind speed at 0.00 had the most people who are using bikes, and followed by 2012-09-10 18:00:00 and 2012-06-14 08:00:00. The predicted results are consisted with our data visualization results.

For our models, to summarize, in this particular project, GAM with Interaction Terms (linear regression family) appears to be the most appropriate model, which make sense because the bike usage does have highly correlation with these hour, day and weather conditions etc. independent variables.

b. Lesson Learned1) The simplest model is often the best;2) Feature selection is quintessential part in building models;3) Always try different goodness of fit to test the models.

Page 48: 6101-Project Report

i https://www.capitalbikeshare.com/homeii https://www.kaggle.com/c/bike-sharing-demandiii Fanaee-T, Hadi, and Gama, Joao, 'Event labeling combining ensemble detectors and background knowledge', Progress in Artificial Intelligence (2013): pp. 1-15, Springer Berlin Heidelberg.