multivariate data analysis project

Preston May Yue Shen Gu Edgardo Estrabo Professor Duncan Temple Lang Lending Club: Who Should We Lend To? In recent years, the US economy has struggled to say the least, making it difficult for people to make safe and profitable investments. The stock market, although trending upward in past months, has been overall unpredictable, banks have been stingy with interest rates, and bonds are expected to lose value over time. So where should people invest their money so that it is safe and produces a respectable return? In addition to this issue of where can we put our money, the recent economic struggles have also put people and kept people in debt. Private debt ratios have been at all time highs in the past five years (Chart 1) and with rising interest rates, it has become extremely difficult for these debtors to completely pay off their debts. So, once someone is in debt, how are they expected to get out of debt in a reasonable time frame? CHART 1 From income-debt-servicing-cost-ratios.html. Debt-to-income ratios refer to private debt divided by household income which peaked in 2008 and has almost doubled since 1982. provides a solution to both of the problems presented above. Lending Club is a company that makes it possible for private investors to loan money to debtors at interest rates higher than banks would pay out to investors, but lower than banks or credit cards would charge debtors. How it works is a debtor will request a loan, giving their credit score, their monthly income, their reason for the loan (typically to pay off credit card debt, see CHART 2), etc. and then investors will decide whether they want to loan their money to that person or not. Lending Club has minimized the risk on the individual by allowing each investor to loan out as little as $25 to any given debtor so that investors can create a diverse portfolio. This way, if a debtor defaults on their loan,

Upload: prestonmay21

Post on 28-Dec-2015




0 download


Used data provided by to create a statistical classification model that classified debtors as "good loans" or "bad loans". Our model ended up having a higher return than the LendingClub average return rate.


Page 1: Multivariate Data Analysis Project

Preston May

Yue Shen Gu

Edgardo Estrabo

Professor Duncan Temple Lang

Lending Club: Who Should We Lend To?

In recent years, the US economy has struggled to say the least, making it difficult

for people to make safe and profitable investments. The stock market, although trending

upward in past months, has been overall unpredictable, banks have been stingy with

interest rates, and bonds are expected to lose value over time. So where should people

invest their money so that it is safe and produces a respectable return?

In addition to this issue of where can we put our money, the recent economic

struggles have also put people and kept people in debt. Private debt ratios have been at all

time highs in the past five years (Chart 1) and with rising interest rates, it has become

extremely difficult for these debtors to completely pay off their debts. So, once someone

is in debt, how are they expected to get out of debt in a reasonable time frame?

CHART 1 – From

income-debt-servicing-cost-ratios.html. Debt-to-income ratios refer to private debt

divided by household income which peaked in 2008 and has almost doubled since

1982. provides a solution to both of the problems presented above.

Lending Club is a company that makes it possible for private investors to loan money to

debtors at interest rates higher than banks would pay out to investors, but lower than

banks or credit cards would charge debtors. How it works is a debtor will request a loan,

giving their credit score, their monthly income, their reason for the loan (typically to pay

off credit card debt, see CHART 2), etc. and then investors will decide whether they want

to loan their money to that person or not. Lending Club has minimized the risk on the

individual by allowing each investor to loan out as little as $25 to any given debtor so

that investors can create a diverse portfolio. This way, if a debtor defaults on their loan,

Page 2: Multivariate Data Analysis Project

the loss is split amongst several investors, not just one. But having a debtor default on the

loan you lent them would still set back your portfolio, so how would you decide who to

invest in?

CHART 2 – From Statistics. Over 75% of loans taken out on are for the purpose of paying off/consolidating preexisting debts.

Lending Club gives every potential investee a grade A1 through G5, A1 being the

safest and G5 being the riskiest. However, although A1 is the safest, it pays the lowest

interest rate of 6.03% while G5 has an interest rate of over 25%, so there is incentive to

invest in the riskier loans. Looking at CHART 3, the proportion of bad loans for each

grade are as we would expect since A1 is the best, and they gradually get worse with each

grade all the way to G5.

Page 3: Multivariate Data Analysis Project

CHART 3 – Manually constructed in R using the data provided by Lending Club.

For code, see APPENDIX.CHART3

So it appears that whatever Lending Club uses to classify each debtor must be a

successful classifier since the default rate gradually increases with each riskiness grade.

However, we do not want to use these grades when deciding what loans we should invest

in because the safer ones pay out a smaller interest rate. So, we want to create our own

classification model to help us not only invest in safe people, but also maximize returns

on our investments.

Using the data provided by Lending Club on their website, we will attempt to

classify each loan as a good loan or a bad loan, a good loan being one that was fully paid

and a bad loan being one that was either not paid or late. Before we get into attempting to

classify each loan as good or bad, first let’s see if good loans and bad loans have different

mean values for some key variables.

The variables that are provided by Lending Club that I think should differ the

most between good and bad loans are as follows. Monthly income and monthly payments

are both important but don’t necessarily tell us much individually since someone could be

taking out a very small loan but make next to nothing. So, we will combine these by

creating a variable which represents the debtor’s monthly income divided by their

monthly payments. This value should be a rough representation of their ability to pay off

their loan. Another variable of interest is their credit score. Lending Club provides us

with a credit score range so we will simply take the lower bound of this range for all of

our observations. The number of years employed should also tell us something about

their likeliness to pay off their loan because if they have had the same job for a long time,

they should have a steady, consistent income. The number of open credit lines someone

has could speak to their ability to actually get out of debt. However, I think the reciprocal

of the number of open credit lines is more meaningful since it would give less weight to

each additional credit card as the number of credit cards gets higher and higher. One last

variable that should be helpful is the debt to income ratio. Just so we expect our good

loans to have a higher value, we will subtract all of the debt to income ratios from 100.

Before we start using these variables, let’s make sure that we don’t have any

weird observations. There were already some observations that contained NAs in

important variables such as length of employment and FICO score. In order to deal with

these, we needed to delete the entire observation where these NAs are present. Since

there are not too many NAs in our data set, this should not affect our data or our results


After deleting these NA observations, let’s take a look at the densities of each of

the variables that we will be using in our difference of means test (SEE CHART 4). The

first five graphs are the densities of our variables and they all seem to be distributed in a

way that makes sense for our data except payment ratio. Payment ratio seems to have

extreme values to the right of the mean so we created a plot that excludes most of these

extreme values in the sixth graph. After investigating why some of our payment ratios are

so high, it seems that these extreme values are due to people taking out a loan of only

$1000. So, the fact that there are extreme values for this variable doesn’t necessarily

mean that they are incorrect. In fact, Lending Club checks each client to make sure they

are telling the truth about themselves, so we should be able to trust that our observations

are accurate.

Page 4: Multivariate Data Analysis Project

Once we have created all of the above variables and checked their legitimacy, we

will subset our data into two separate data frames: one being all good loans, and one

being all bad loans. The code for all of this can be found in Appendix – Code1. Using

these subsets, we can run a difference of means Hotelling’s T-test with a null hypothesis

of their true means being equal (See Appendix – Test1). The means of each variable for

both of our subsets are as follows:

means_bad [1] 85.8060328 21.0530092 701.1652447 4.9735070 0.1413222 means_good [1] 87.4459685 26.0582156 717.4825052 4.6609517 0.1366743

Without running any tests, it is pretty clear that our subsets’ mean values are not

equivalent. Although the last two, which represent years of employment and the

reciprocal of the number of open credit lines respectively, appear to be fairly close in

value, the other three, which represent 100 – debt to income ratio, monthly income

divided by monthly payment, and credit score respectively, are not even close in value.

After running our Hotelling’s T-test, we end up with an F* value of 909.784 while

our critical F value is only 15.09436. Since our F* value is so much greater than our

critical F value, we would reject our null hypothesis and conclude that the true means of

our variables are different between good and bad loans. This is not surprising since we

had already determined that our mean vectors did not appear to be very close to each

Page 5: Multivariate Data Analysis Project

other in value. So, now that we have determined that the two subsets of our data are

different in value, we can begin creating a regression model to predict the probability that

any given borrower will pay back their loan in full.

Logistic Regression: For code, see Appendix – Logistic Regression

Logistic regression is a technique for fitting a regression surface to data in which

the dependent variable has two outcomes. In this case, the dependent variable would be

determining whether a loan is good (borrower pays back the loan fully) or bad (borrower

does not pay back the loan, or the borrower pays it back late). This model will describe

the relationship between this binary response variable and several sets of explanatory

variables; most of which are continuous, but some are categorical. In the end, not all of

the predictors will be included in the final model for reasons that will be discussed later.

To create the initial model, the observations with already determined status

(labeled 0: for bad loan, 1: for good loan) were divided into a training set containing

roughly 70% of these observations, and a testing set containing the rest. The training set

will be used to create the logistic model, while the test set will be used to test the

accuracy of this model. 12 numeric variables and 2 categorical variables (Purpose of

Loan and Home Ownership) were entered as predictors. Variable selection was primarily

determined by using the stepAIC function from R’s MASS package. This function selects

the best subset of predictors by using stepwise regression whilst utilizing Akaike’s

Information Criterion – generally, a model with a smaller AIC value (compared to one

with a bigger value) indicates that the model with the smaller AIC value minimizes

information loss better, compared to the other one. Running the initial model, we obtain

the coefficients for the predictors below:

Page 6: Multivariate Data Analysis Project

Coefficients in a logistic regression model indicate the expected change in log

of odds ratio of the response variable, per 1 unit increase in the coefficient. Note that

odds ratio is defined as the probability of success divided by probability of failure. In this

case, odds = P(Good Loan)/P(Bad Loan) or equivalently, P(Good Loan)/[1 – P(Good

Loan)]. We then simply take the natural log of this odds ratio to get the log of odds ratio.

This transformation is necessary in order to obtain nice mathematical properties needed

for the modeling. For instance, unlike probability, log odds are not bounded between the

values 0 and 1. To actually calculate the coefficients, maximum likelihood estimation is

used. Basically, MLE seeks to manipulate the coefficients such that the log likelihood

that the observed values of the outcome may be predicted using the given set of

explanatory variables.

To interpret the coefficients, we first need to convert them from being expressed

in log odds to just plain odds ratio. To do this, we would simply exponentiate the

coefficient of interest. For example, to interpret a dummy variable such as Loan Purpose:

Debt Consolidation, we take the coefficient’s exponential (exp(0.3681) = 1.44) and say

that holding everything else constant, people who borrow from Lending Club for the

purpose of Debt Consolidation are 1.44 times or 44% times more likely to fully pay back

their loan than those who borrow for some other purpose. For numerical variables such as

loan length, we can say that holding everything else constant, we expect a (exp(-.03714 =

0.96, or 0.96 – 1 = -.04) 4% decrease in the odds that the borrower does not pay back the

loan, per one year increase in employment length. Because the predictors were not

standardized before entering the model, the metric of the original values (e.g., 100-Debt

to Income Ratio being in decimals vs. Credit Grade (Lower Bound) being in hundreds)

should be taken in consideration before assessing the importance of these predictors.

Page 7: Multivariate Data Analysis Project

For easier interpretation, we can also transform the log odds ratio given by the

original model back into probability by using the equation above. Just to reiterate, P(y =

1) denotes the probability of a borrower fully paying back a loan. This probability is

determined by an intercept (alpha) and explanatory variables (Beta coefficients) provided

in the previous page. Using the predict() function in R, we obtain these probabilities

which are visualized by the histogram below.

A quick look at this histogram indicates that the majority of loans in the testing

set have greater than 0.5 probability of being paid in full. To check the accuracy of this

prediction, observations with predicted probabilities greater than .5 were assigned as

“Good Loans”, and observations with less than .5 probabilities were assigned as “Bad

Loans.” These were then compared to the actual status of these loans. A confusion matrix

displaying the accuracy of this “classifier” is displayed below:

Page 8: Multivariate Data Analysis Project

Using a .5 probability cutoff, the accuracy of this classifier is 76.12%, which is good, but

not great. Adjusting the probability cutoff to .6 instead (such that Good Loans are those

with predicted probabilities > .6), we obtain the following confusion matrix:

… which indicates that we may be obtaining an overall, less accurate classifier by

increasing our probability cutoff. However, this may be a better classifier than the

previous one if we simply pay more attention to minimizing the error rate of predicting

bad loans as good loans (and not caring too much about good loans being predicted as

bad loans).

For parsimony’s sake, we will try to create another logistic model which uses

fewer predictor variables. The first sets of predictor variables to be removed from the

model are Loan Purpose (Car1, CreditCard1, etc.) and Home Ownership (with Rent being

the only significant dummy variable). The reason for this is due to the unevenness of the

cases in these variables. This unevenness may have caused the model to blow up , i.e.,

make the coefficient estimation inaccurate (Refer to CHART 2 in part I to see the Loan

Purpose pie chart, notice that 75% of loans from Lending Club are for the purpose of debt

consolidation). Other predictors that will be removed are Amount Funded By Investors,

and Public Records on File, mainly because their p-values were comparatively bigger

compared to the other predictors. Having excluded these predictors from the model, we

obtain the coefficients for the final model. A mosaic plot visualizing this model’s

accuracy is also displayed below. It can be seen that the model’s accuracy did not

decrease at all after removing a considerable amount of predictors.

Page 9: Multivariate Data Analysis Project

Blue: Correct Prediction. Red: Incorrect Prediction.

Now considering this classifier’s practicality, we can infer from the mosaic plot

that this classifier tends to misclassify bad loans as good loans at a pretty high rate.

Because of this, there’s a good chance that Lending Club lenders would lend out money

to someone whom they thought to be a good borrower, but is in fact, a bad borrower.

Still, this classifier predicts loan status accurately at greater than chance level so it should

still be of some use.

We will now predict whether loans with unpredicted status would turn out to be as

good loans or bad loans. We now run another logistic model, now using the combined

training and test data set. This model will then be used to predict the loan status of

observations labeled as “Current” or “Performing Payment Plan”. As indicated by the

histogram and table below, the majority of these loans will turn out to be good loans. In

the next section, we will attempt to increase the accuracy of this classifier by

implementing bootstrap aggregating, a.k.a. bagging, to this model.

Page 10: Multivariate Data Analysis Project
Page 11: Multivariate Data Analysis Project

Bootstrap Aggregation using Logistic Regression: For code, see Appendix – Bagging:

Logistic Regression

Bootstrap aggregation attempts to increase the power of a predictive statistical

model by taking multiple random samples (with replacement) from a training data set.

These random (bootstrap) samples will then be used to create separate models, in this

case, logistic regression models, which will be used to create separate predictions for the

observations of a given test set. Finally, the average value of these predictions will be

used as the final prediction values.

To determine whether bootstrap aggregation was an appropriate choice, we

simply compare the accuracy rate of the original logistic regression model with the

bootstrap aggregated version. The accuracy rate for the final logistic model from the

previous section was 76.29%. Using bootstrap aggregation, the accuracy rate stayed

roughly the same at 76.27%. Lack of improvement in the accuracy could have been due

to the fact that the original model’s predictors were already stable to begin with.

Bootstrap aggregation may be more suitable in unstable models. The greater the

variability in the predictions of these unstable models, the better the improvement will be

seen in the final averaged predictions. In any case, we will simply keep the final model

from the previous section and disregard this one.

See also below two plots that were generated using the predicted probability values given

by the logistic model from the previous section. Looking back at CHART 3, we see that

default rate decreases as letter grade gets better. This is consistent with the first chart

below, which shows that as the letter grade assigned by Lending Club gets better, the

higher is the probability that the loan will be paid back in full. The next plot shows the

same probability but now in terms of the State location of a borrower. Looking at this

plot, we can see that borrowers from Iowa, Washington DC, West Virginia, and

Massachusetts are the most likely to pay back their loan in full. In contrast, borrowers

from Wyoming, Arkansas, Vermont and Hawaii would be the least likely to pay back

their loan properly.

Page 12: Multivariate Data Analysis Project


Page 13: Multivariate Data Analysis Project

Now that we have created a logistic model for our data, we will now attempt to

use algorithms which were actually designated to act as classifiers. There are several

methods that we could utilize in doing this. There is the k-nearest neighbors technique

which takes each observation and compares it to other observations that it is similar to.

So, essentially, this method is saying “which observations do you look like and what

were their classifications?” Since so many of our independent variables are categorical, it

would be extremely difficult to implement this method, so we will attempt to classify our

data in another way.

Two other methods that we will be running are random forests and boosting.

Random Forest: For code, see Appendix – Classification – Random Forest

Another method of classification is the random forest method. Random forest is

basically creating an ensemble of tree models using bagging and randomly generating

reduced number of predictors at each split for each tree. In other words, random forest

creates a group of trees that are very different from each other. Random forest then use

all of these trees to predict the independent variable value for new observations by letting

the trees in the ensemble to vote for the best outcome.

This widely used model works well with dataset containing both numerical and

categorical variables. The package we used is randomForest. Our goal for modeling is to

find the smallest ratio of values that are actually bad when it’s been predicted good over

all the values that are predicted as good. The reason why we are interested in this ratio is

that we want to figure out what’s the error rate if we invest in the observation that are

been predicted as good.

We first used the cross-validation function in randomForest (rfcv) to determine

the best number of reduced number of predictors at each split. The output is as follows:

#> rfloanscv$

#12 6 3 1

#0.1828936 0.1847596 0.1828851 0.1850041

Looking at the result from rfcv, best number of reduced number of predictors at each split

is 3 because it gives the smallest mean squared error when compared to the other values.

The next tuning parameter that we tried to optimize is the number of trees in the

ensemble. Using a sapply() loop, we generated the following graph that shows the value

for the ratio on training dataset for each additional 10 trees in the assemble. We can see

that the best number of trees obtained from this graph is 20 trees.

Page 14: Multivariate Data Analysis Project

Using 20 trees in the ensemble and 3 predictors as the reduced number of

predictors at each split, we built a random forest model. The following is the confusion

matrix on the training dataset. The model predicts the training dataset nearly perfectly.

The ratio value for training dataset is 0.7%. An explanation for such a perfect prediction

on the training dataset is that, for each observation, about only 1/3 of the 20 trees have

the observation in their OOB dataset. With the other 2/3 of the 20 trees using the

observation to build their models, there’s an overwhelming number of trees that could

predict the value at each observation really well.

predict actual Bad Good Bad 2996 73 Good 3 10220

However the predictions on the OOB dataset aren’t as amazingly good as the ones

for training dataset. The ratio value for OOB dataset is 20.5% with 77% error rate for

observations that are actually ‘Bad’ and 10% error rate for observations that are labeled

as ‘Good’. The confusion matrix for OOB dataset is as follows with the rows

representing actual values and columns representing predicted values.

Bad Good class.error Bad 695 2373 0.7734681 Good 1029 9192 0.1006751

Similar to the confusion matrix of OOB dataset, the confusion matrix based on

testing dataset gives a ratio value of 19.65% with a 80% error rate for actual Bad loans

and a 5% error rate for actual Good loans. predict

Page 15: Multivariate Data Analysis Project

actual Bad Good Bad 231 944 Good 208 3859

The following table shows the importance of each predictor by the value of mean

decrease Gini. Based on this table, interest rate is the most important predictor and public

records on file is the least important predictor.

MeanDecreaseGini Loan.Length 160.48590 Loan.Purpose 367.97988 Home.Ownership 120.17518 Revolving.CREDIT.Balance 645.74894 Delinquencies..Last.2.yrs. 84.41817 Public.Records.On.File 48.61846 DTI_inv 664.52132 payment_ratio 679.23169 lower_score 455.15278 length_of_employment 309.10431 inv_num_cards 417.22794 new.Interest.Rate 714.86348

Boosting: For code, see Appendix – Classification – Boosting

We also used boosting approach to model the loan data. The package that we used

is ada. The first thing we did is to optimize the number of trees to be included in the

boosting ensemble. We used a sapply() loop similar to the one for random forest to figure

out the best number of trees. The following plot shows that 200 trees is the best value.

Page 16: Multivariate Data Analysis Project

Using 200 trees in the ensemble, we built a boosting model and obtained the

following confusion matrix on the training dataset. The ratio value is 21% with an error

rate of 84% for Bad loans and 3% for Good loans.

Final Prediction True value Bad Good Bad 487 2582 Good 322 9901

The confusion matrix on testing dataset is as follows. The ratio value is 22% with

an error rate of 98% for Bad loans and 0.3% for Good loans.

predicted actual Bad Good Bad 24 1151 Good 12 4055

The following plot shows the importance for each variable. Interest rate and

number of credit cards are the two most important variables whereas loan purpose is the

least important variable.

It is difficult to determine exactly which model is our best since there are different

types of error. However, for our data, predicting a loan is a good loan incorrectly is much

worse than predicting a loan is a bad loan when it is actually a good loan. The reason for

this is when we predict a loan is a good loan, we would invest in it, and if it turns out that

it is a bad loan, we will not get our money back. However, if we mark a good loan as a

bad loan, it simply means we are not investing in someone who would have paid us back

in full. We will call the better error type I error and the worse error type II error.

Page 17: Multivariate Data Analysis Project

First let’s look at our regression models. If we are calculating just the type II

error, then the error for our full logistic model would be 22.355% when we used a .5

probability cutoff as our classifier but it decreased to 19.93% when we increased the

probability cutoff to .6. We would expect our error to decrease with the increase in the

classifier since we are only classifying good loans as loans that have a predicted

probability of paying us back of 60% or higher, rather than only 50%. 19.93% type II

error is pretty good and would put us at the same risk level as if we had invested all of

our money into people with a credit grade of between B2 and B3 (See Chart 3).

Looking at our classification models, using random forests produced a type II

error of only 19.65% while boosting produced a 22.11% type II error. Boosting was also

not desirable since it only classified 36 loans as bad loans out of the 5000+ in our test

data. So, out of all of our models, it seems that using random forests was the best to

minimize type II error.

So, was this classification model any more useful than simply using the credit

grades given to us by Lending Club? The type II error in our random forests model was

19.65% which means that if we had invested in all of our “good loans” from our random

forests model, the default rate would have been 19.65%. This is a default rate slightly

higher than all of the loans with a B2 grade while slightly lower than those with a B3

credit grade, so we would be investing with approximately the same risk as investing into

everyone with a B2 or B3 credit grade. However, looking at the credit grades of our

“good loans”, the median credit grade of these loans is B4. So, what this means is that

although we would be investing with the same risk as investing in everyone with a B2 or

B3 credit grade, we would be receiving the interest rates similar to if we had invested in

all the B4s. Or, in other words, we are receiving the same interest rate as someone who

invested in all B4s, but less of our loans would default.

This finding is actually really interesting because it shows that our classification

models are more useful than simply using Lending Club’s credit grade system. Investing

in all of our “good loans” allows you to invest with a lower risk and receive a higher

interest rate which is exactly what we want when we are investing. Using this

classification method rather than using the credit grades given to us would give us an

expected additional return on our investments of approximately 1.5% (B4 - (B2+B3)/2).

This additional 1.5% translates to an additional $682.92 earned over five years with an

initial investment of $5,000!

Page 18: Multivariate Data Analysis Project

APPENDIX: loans = read.csv("C:\\Users\\Owner\\Downloads\\LoanStats.csv")

CODE1 #create our new variables that we want to run a difference of means

test on

#use regular expressions to extract only the numerical value from the

debt-to-income percentage

DTI_num = as.numeric(gsub("%", "", loans$Debt.To.Income.Ratio))

#subtract the percentage from 100 so that the expected "good" debtor

has a higher value

loans$DTI_inv = 100 - DTI_num

#divide each monthly income by their monthly payment

loans$payment_ratio = loans$Monthly.Income/loans$Monthly.PAYMENT

#use the substring function to get only the lower bound of the FICO


loans$lower_score = as.integer(as.character(substr(loans$FICO.Range, 1,


#use regular expressions to isolate only the number in employment



b[b=='< 1']<-0

loans$length_of_employment = as.numeric(b)

#take the reciprocal of the number of open credit lines so that our

predicted "good" debtor has a higher value

loans$inv_num_cards = 1/loans$Open.CREDIT.Lines

#delete all observations that have an NA response for any of our


new_loans = loans[-which($length_of_employment) == TRUE),]

new_loans = new_loans[-which($lower_score) == TRUE),]

#subset our data into good loans and bad loans

bad_loans = subset(new_loans, new_loans$Status == "Charged Off" |

new_loans$Status == "Default" | new_loans$Status == "Late (16-30 days)"

| new_loans$Status == "Late (31-120 days)")

good_loans = subset(new_loans, new_loans$Status == "Fully Paid")

CHART3 #calculate the default rate for each credit grade by dividing the

number of bad loans for each grade by the total number of loans

completed for each grade

Default_Rate =


de) + summary(bad_loans$CREDIT.Grade)))

barplot(Default_Rate, xlab = "Letter Grade", ylab = "Default

Percentage", main = "Default Rate By Letter Grade")

CHART4 #plot the distributions of all of our variables of interest

par(mfrow = c(2,3))

plot(density(new_loans$length_of_employment), main = "Density for

Length of Employment")

plot(density(new_loans$inv_num_cards), main = "Density for Inverse # of


Page 19: Multivariate Data Analysis Project

plot(density(new_loans$payment_ratio[-which(new_loans$payment_ratio >

250)]), main = "Fixed Density for Payment Ratio")

plot(density(new_loans$payment_ratio), main = "Density for Payment


plot(density(new_loans$DTI_inv), main = "Density for Debt to Income


plot(density(new_loans$lower_score), main = "Density for FICO Score")

TEST1 #create a data matrix for both of our subsets

data_bad = matrix(c(bad_loans$DTI_inv, bad_loans$payment_ratio,

bad_loans$lower_score, bad_loans$length_of_employment,

bad_loans$inv_num_cards), ncol = 5)

data_good = matrix(c(good_loans$DTI_inv, good_loans$payment_ratio,

good_loans$lower_score, good_loans$length_of_employment,

good_loans$inv_num_cards), ncol = 5)

#create two numerical vectors of the means of each of our variables of

interest for both of our subsets of data

means_bad = colMeans(data_bad)

means_good = colMeans(data_good)

#create the variance/covariance matrices for both of our subsets

s_bad = cov(variables_bad)

s_good = cov(variables_good)

#calculate the number of observations in each of our subsets

n_bad = nrow(variables_bad)

n_good = nrow(variables_good)

#pool our variance/covariance matrices since our variances seem to be

almost equivalent

s_pooled = (n_good * s_good)/(n_bad + n_good) + (n_bad * s_bad)/(n_bad

+ n_good)

s_diff = s_pooled/n_good + s_pooled/n_bad

s_diff_inv = solve(s_diff)

#calculate our F* value using a Hotellings T-test

FStar = (t(means_good - means_bad)) %*% s_diff_inv %*% (means_good -


p = 5

#calculate our critical F-value which we will compare to our calculated

F* value

F = ((p*(n_good + n_bad - 2))/(n_good + n_bad - p - 1)) * qf(.99, p,

n_good + n_bad - p - 1)

Logistic Regression ####################################

##variables for logistic regression#



#remove current and performing payment plan status

Page 20: Multivariate Data Analysis Project

data <- subset(loans, Status != "Current" & Status != "Performing

Payment Plan ")


#recode Status: Charged Off, Default, In Grace Period, Lates: 0. Fully

Paid: 1


data$Status <- as.factor(data$Status)

data$Status <- factor(with(data, ifelse ((Status != "Fully Paid"), 0,

1) ))



#years employed

data[40] <- as.numeric(factor(data[[40]], levels = c("< 1 year", "1


"2 years", "3 years", "4

years", "5 years", "6 years",

"7 years", "8 years", "9

years", "10+ years", "n/a"),

labels = c("0", "1", "2", "3", "4", "5", "6", "7",

"8", "9", "10", NA)))


#lower bound FICO range

#convert fico range to numeric

a <- substr(as.character(data[[26]]), 1,


data[[26]] <- as.numeric(a)


# 1/number open credit lines

data[[28]] <- 1/data[[28]]


#[16] 100 - debt to income ratio

data[[16]] <- 100 - data[[16]]


#loans[25]/loans[13] monthly income/monthly payment

data[[25]] <- data[[25]]/data[[13]]


#[5] loan length, 0: 36 months, 1: 60 months

data$Loan.Length <- factor(with(data, ifelse ((Loan.Length == "36

months"), 0, 1) ))


#rename some variables

colnames(data)[c(26, 28, 16, 25)] <-


"OpenCreditLines(Reciprocal)", "100-DebtToIncomeRatio",


#combine everything so far

dataB <- cbind(data[c(14, 40, 26, 28, 16, 25, 3, 5, 4, 30, 35, 37)])


Page 21: Multivariate Data Analysis Project


##include dummy variables

##[11] loan purpose## 0: no, 1: yes. 13-25 on final data frame


# "other" would be all 0s

Car <- factor(with(data, ifelse ((Loan.Purpose != "car"), 0, 1) ))

CreditCard <- factor(with(data, ifelse ((Loan.Purpose !=

"credit_card"), 0, 1) ))

DebtConsol <- factor(with(data, ifelse ((Loan.Purpose !=

"debt_consolidation"), 0, 1) ))

Educational <- factor(with(data, ifelse ((Loan.Purpose !=

"educational"), 0, 1) ))

HomeImprov <- factor(with(data, ifelse ((Loan.Purpose !=

"home_improvement"), 0, 1) ))

House <- House <- factor(with(data, ifelse ((Loan.Purpose != "house"),

0, 1) ))

MajorPurchase <- factor(with(data, ifelse ((Loan.Purpose !=

"major_purchase"), 0, 1) ))

Medical <- factor(with(data, ifelse ((Loan.Purpose != "medical"), 0, 1)


Moving <- factor(with(data, ifelse ((Loan.Purpose != "moving"), 0, 1)


RenewableEnergy <- factor(with(data, ifelse ((Loan.Purpose !=

"renewable_energy"), 0, 1) ))

SmallBusiness <- factor(with(data, ifelse ((Loan.Purpose !=

"small_business" ), 0, 1) ))

Vacation <- factor(with(data, ifelse ((Loan.Purpose != "vacation"), 0,

1) ))

Wedding <- factor(with(data, ifelse ((Loan.Purpose != "wedding"), 0, 1)


#[24] home ownership, "NONE" would be all 0s. 26-29 on final data frame


Mortgage <- factor(with(data, ifelse ((Home.Ownership != "MORTGAGE"),

0, 1) ))

Other <- factor(with(data, ifelse ((Home.Ownership != "OTHER"), 0, 1)


Own <- factor(with(data, ifelse ((Home.Ownership != "OWN"), 0, 1) ))

Rent <- factor(with(data, ifelse ((Home.Ownership != "RENT"), 0, 1) ))

###combine everything###

dataB[c(13:29)] <- c(Car, CreditCard, DebtConsol, Educational,

HomeImprov, House, MajorPurchase, Medical, Moving, RenewableEnergy,

SmallBusiness, Vacation, Wedding,

Mortgage, Other, Own, Rent)

colnames(dataB)[13:29]<- as.character(c("Car", "CreditCard",

"DebtConsol", "Educational", "HomeImprov", "House", "MajorPurchase",

"Medical", "Moving", "RenewableEnergy", "SmallBusiness", "Vacation",


"Mortgage", "Other", "Own", "Rent"))

#remove rows with NAs

dataB <- na.omit(dataB)

#final data frame of response + predictors

Page 22: Multivariate Data Analysis Project



#divide into test and training data


TestSet <- dataB[testID, ]

TrainingSet <- dataB[-testID, ]


#Fitting the logistic Model#


logit.train <- glm(Status ~ .,

family = binomial,

na.action = na.exclude,

data = TrainingSet)



fit1r <- stepAIC(logit.train)

TestSet$Prediction1 <-as.numeric(predict(fit1r, newdata=TestSet,


hist(TestSet$Prediction1, ylab = "Frequency", xlab = "Predicted


main = "Predicted Probability for Testing Set", col = "lightblue")

#assessing accuracy of the model


# if status = 1 and predicted probability > .5, (i.e., prediction is


TestSet$Predicted1b <- factor(with(TestSet, ifelse ((Prediction1 < .5),

"Bad Loan", "Good Loan") ))

TestSet$Actual <- factor(with(TestSet, ifelse ((Status == 0), "Bad

Loan", "Good Loan") ))

cm1 <- confusionMatrix(data = TestSet$Predicted1b, reference =



#using a .6 cutoff

TestSet$Predicted1c <- factor(with(TestSet, ifelse ((Prediction1 < .6),

"Bad Loan", "Good Loan") ))

cm2 <- confusionMatrix(data = TestSet$Predicted1c, reference =



# reducing the number of predictors for the sake of parsimony

TrainingSet2 <- TrainingSet[c(1:12)]

logit.2 <- glm(Status ~ .,

family = binomial,

na.action = na.exclude,

data = TrainingSet2)

fit2r <- stepAIC(logit.2)

#final Model for testing set, remove unwanted predictors

Page 23: Multivariate Data Analysis Project

TrainingSet3 <- TrainingSet2[c(1:6,8:9)]

logit.3 <- glm(Status ~ .,

family = binomial,

na.action = na.exclude,

data = TrainingSet3)

FinalModel <- stepAIC(logit.3)


TestSet$Prediction2 <- as.numeric(predict(FinalModel, newdata=TestSet,


#assessing the accuracy of the Final Model

TestSet$Predicted2b <- factor(with(TestSet, ifelse ((Prediction2 < .5),

"Bad Loan", "Good Loan") ))

cm3 <- confusionMatrix(data = TestSet$Predicted2b, reference =



sieve(as.table(cm3),shade = TRUE, main = "Mosaic Plot for Accuracy of

Final Model")

cotabplot(as.table(cm3),shade = TRUE, main = "Mosaic Plot for Accuracy

of Final Model")


#prediction for undetermined loan status#


#all observations with determined status

dataC <- dataB[c(1:6,8:9)]

#(repetitive code ommitted)

Predict2 <- Predict[c(14, 40, 26, 28, 16, 25, 5, 4)]

names(Predict2)[c(1:8)] <- names(TrainingSet3)

logit.4 <- glm(Status ~ .,

family = binomial,

na.action = na.exclude,

data = dataC)

Classifier <- stepAIC(logit.4)


Predict2$Probability <- as.numeric(predict(Classifier,

newdata=Predict2, type="response"))


histogram(Predict2$Probability, main = "Predicted Probability of Good

Loans", col = "lightblue", xlab = "Probability", ylab = "Frequency",

type = "percent")

Predict2$PredictedStatus <- factor(with(Predict2, ifelse ((Probability

< .5), "Bad Loan", "Good Loan") ))


Bagging: Logistic Regression ###########

# Bagging #


Page 24: Multivariate Data Analysis Project

# code from:


logistic.bagging <-



predictions<-foreach(m=1:iterations,.combine=cbind) %do% {

training_positions <- sample(nrow(training),


train_pos <- 1:nrow(training) %in% training_positions

lm_fit<-glm(Status~., family = binomial,

na.action = na.exclude,data=training[train_pos,])

lm_fit2 <- stepAIC(lm_fit)

predict(lm_fit, newdata = testing, type = "response")




TestPrediction.bag <- logistic.bagging(training = TrainingSet3, testing

= TestSet[-1])

TestSet$Predict.bag <- factor(ifelse ((TestPrediction.bag < .5), "Bad

Loan", "Good Loan"))

cmTest.bag.logistic <- confusionMatrix(data =

as.factor(TestSet$Predict.bag), reference = TestSet$Actual)

#Prediction for undetermined loan status

logistic.prediction.bag <- logistic.bagging(training = dataC, testing =



Predict2$CreditGrade <- Predict$CREDIT.Grade

ProbabilityByGrade <- tapply(Predict2$Probability, INDEX =

Predict2$CreditGrade, FUN = mean)

ByCreditGrade <- as.matrix(cbind(c(as.numeric(ProbabilityByGrade)),

c("A1", "A2", "A3", "A4", "A5", "B1", "B2", "B3", "B4", "B5", "C1",

"C2", "C3", "C4", "C5", "D1", "D2", "D3", "D4", "D5",

"E1", "E2", "E3", "E4", "E5",

"F1", "F2", "F3", "F4", "F5", "G1", "G2", "G3", "G4", "G5"))

barchart(ProbabilityByGrade, xlab = "Mean Probability of Loan being

Paid in Full",

ylab = "Letter Grade", main = "Probability that a loan is a

Good Loan by Letter Grade", col = (c(rep("firebrick2", 5),

rep("darkorange2", 5),

rep("gold2", 5),

rep("darkseagreen4", 5), rep("dodgerblue2", 5), rep("darkorchid4", 5),

rep("darkviolet", 5)))

ProbabilityByState <- tapply(Predict2$Probability, INDEX =

Predict2$State, FUN = mean, na.rm = TRUE)


ProbabilityByState <- tapply(Predict2$Probability, INDEX =

Predict2$State, FUN = mean, na.rm = TRUE, col =)

Page 25: Multivariate Data Analysis Project

dotplot(as.matrix(sort(ProbabilityByState)), xlab = "Mean Probability

of Loan being Paid in Full",

ylab = "State", main = "Probability that a Loan is a Good Loan

by State",

col = "blue")


#recreated data in an easier way on a later date

loans<-loandat[loandat$Status%in%c('Charged Off','Default','Fully


'Late (16-30 days','Late (31-120


# change interest.rate to numeric values



# Dependent variable


loans$type[loans$Status=='Fully Paid']<-'Good'


testID<-sample(1:nrow(data),round(nrow(data)*.3)) # 28% data in test

save('testID',file='C:/Users/Yue Shen/Desktop/testID.R')

load('C:/Users/Yue Shen/Desktop/testID.R')


loans.test<-loans.test[$length_of_employment)==F,] #

get rid

# off NA's


loans.train<-loans.train[$length_of_employment)==F,] #


# rid off NA's


Random Forest library(randomForest)




#> rfloanscv$

#12 6 3 1

#0.1828936 0.1847596 0.1828851 0.1850041

# Thus doesn't depend much on number of variables at each step








Page 26: Multivariate Data Analysis Project


main='(Actual=Bad and Predict=Good)/(Predict=Good)

by number of trees in random forest',

type='l',xlab='Number of trees X 10',ylab='Value for the



order(numtreerf) # 20 trees











# to calculate median credit.grade





Boosting numtreeb2<-sapply(11:20,function(i){







plot(numtreeb,main='(Actual=Bad and Predict=Good)/(Predict=Good)

by number of trees in boosting',

type='l',xlab='Number of trees X 10',ylab='Value for the







