classification via logistic regression

Classification via Logistic Regression: Predicting Probability of Speed Dating Success

1


Taweh Beysolow II Professor Nagaraja Fordham University


2

I. Introduction

In speed dating, participants meet many people, each for a few minutes, and then

decide who they would like to see again. The data set we will be working with contains

information on speed dating experiments conducted on graduate and professional

students. Each person in the experiment met with 10-20 randomly selected people of the

opposite sex (only heterosexual pairings) for four minutes. After each speed date, each

participant filled out a questionnaire about the other person. Our goal is to build a model

to predict which pairs of daters want to meet each other again (i.e., have a second date).

The list of variables are as follows:

• Decision : 1 = Yes (you would like to see the date again), 0 = No (you would not

like to see the date again

• Like: Overall, how much o you like this person? (1 = not at all, 10 = like a lot)

• PartnerYes: How probable do you think it is that this person will say ‘yes’ for

you? (1 = not probable, 10 = extremely probable)

• Age: Age

• Race: Caucausian, Asian, Black, Latino, or Other

• Attractive: Rate attractiveness of partner on a scale of 1-10 (1 = awful, 10 =

great)

• Sincere: Rate sincerity of partner on a sale of 1-10 (1 = awful, 10 = great)

• Fun: Rate how fun partner is on a scale of 1 – 10 (1 = awful, 10 = great)

• Ambitious: Rate ambition of partner on a scale of 1 – 10 (1 = awful, 10 = great)

• Shared Interest: Rate the extent to which you share interests/hobbies with

partner on a scale of 1 – 10 (1 = awful, 10 = great)

We will be using a reduced version of this experimental data with 276 unique male-

female date pairs. In the file “SpeedDating.csv”, the variables have either “M” for male

or “F” for female. For example, “LikeM” refers to the “Like” variable as answered by the

male participant (about the female participant). Treat the rating scale variables (such as

“PartnerYes”, ”Attractive”, etc.) as numerical variables instead of categorical ones for

our analysis.


3

II. Exploratory Analysis

When constructing the contingency table below, we see the following results:

As such, we observe that approximately 22.83% percent of those who participated

in the study are interested in a second date. Moving forward from this, we will say a

second date is planned only if both people within the matched pair want to see each other

again. As such, we will make a new column in our data set and call it “second.date”. The

value in this column will be 0 if there will be no second date, 1 if there will be a second

date.

Here, we observe that there are roughly equal amounts of each data point in each of

the relative corners of the scatterplot. This is also the visual representation of the

contingency index, as we described prior to constructing this graph. Using the Jitter

function, we were able to add random noise, otherwise, all of the data points would have

plotted on top of each other at the corresponding corners of the graph. The blue denotes

clusters of people who will be going on a second date, whereas the red denotes those who

will no


4

All of the variables, with the exception of the decision, second date, and age variables,

are on a 1 to 10 scale. The responses recorded, excluding NA values, all as low as 1 to as

high as 10. Furthermore, we observe that there are exactly 142 entries of data missing.

These are scatter among all the variables, excluding the decision variable and also our

newly constructed second date variable. Being that the great majority of responses for the

variables are on a 1-10 scale, and sort of centering of values would be unnecessary. For

those variables that aren’t on a 1 to 10 scale, we shall choose to exclude the decision

variable from our response, and should we choose to use variables that aren’t on the 1 –

10 scale, consider centering or normalization of some sort. We should hesitate to remove

NA values before we decide what variables we are using in our model, as we do not want

to unnecessarily reduce our sample size.

The possible categories for race observed are Asian, Black, Caucasian, Latino, and

other. There are 3 missing races within the data set. It is worth consider whether to delete

these data points, however this should be done so only if we are to use this variable in our

experiment. The reason being is that while the respondent in this study might have

forgotten to fill out their race, they could have filled out responses for other variables. As

such, we don’t want to unnecessarily remove variables from our data study, as to preserve

as high a sample size as possible.


5

III. Experimental Design

To determine the best logistic regression model, we included all of the variables on a

1 – 10 scale, and left out race as well as the decision variable. From this, we observe that

our model is relatively weak, so we perform backwards-stepwise regression, and pick the

variables with the highest AIC score. After two more iterations, we stop at what is our

third model. With second date as the response variable, shared interests male and shared

interests female become the explanatory variables. The summary output for this model is

as shown below:

We observe an AIC Score of 211.09, and we see that all variables are statistically

significant, and above a 95% significance level. As for the regression assumptions, the

data was collected in an independent manner, as test subjects were asked their responses

separately and we assume that these responses were collected accurately. There are few

outliers, as we have limited our scores within defined ranges that no participants have

deviated from (with the exception of the NA values which the regression can handle).

Our sample size is considerably large, at 226, approximately 81% o f the original test

data being preserved. With respect to the residuals, we observe the following:


6

The errors exhibit independence and we can also see that the residuals are substantial

enough to prove that knowing x values does not completely determine whether y = 0 or y

= 1. As such, we can conclude that our regression assumptions are satisfied and that we

can proceed with the remainder of the experiment. We remove NA values by row, and

establish a threshold based on what maximizes our sensitivity statistic. As described prior,

we remove 50 NA values, and we observe that our threshold for probabilities will be

approximately 48%.

When looking at the coefficients for our explanatory variables and the slope

coefficient, we observe the following:

Where “sharm” is Shared Interests Male and “sharf” is Shared Interests Female.

Both variables have similar slopes, and both help to increase the probability of a date.

Be this as it may, the intercept has a significantly negative effect on the probability of a

date relative to the other slopes. While it was expected that the two variables should have

an increase in the probability of a date, as they are both highly correlated variables with


7

one another, it was surprising that the intercept had this negative of an effect on the

probability. This indicates, intuitively, that the model has a bias to assume two

individuals will generally not be a good match for each other prior to inputting data.

As briefly touched upon before, we went with a threshold of roughly 48%, which was

calculated as the average of the mean of the odds and the median of the odds as

calculated by the model. This was accomplished by choosing to maximize the sensitivity

statistic, while also retaining a reasonably high value in specificity and sensitivity as well.

We observed the effect of choosing a 10 percent, 20 percent, mean of the odds percent,

and the mean of the mean and median of the odds percent thresholds on all of these

statistics. As such, we saw that our sensitivity statistic was equal to approximately 89% ,

while others were higher under the final threshold choice, and as such we chose 48%.’

IV. Results

a. Accuracy 0.6725664

b. Sensitivity 0.8888889

c. Specificity 0.4

d. ROC Curve

Area Under the Curve: 0.696


8

V. Conclusion

As we can see from the new contingency table, our model has a sensitivity rating of

approx. 89%. The reason that we wanted to maximize accuracy rather than other statistics,

such as sensitivity or specificity, is that for the application of this data in a contemporary

context, there is much more benefit to being able to match people properly than to

prevent them from matching people they might not like as much. Should this model be

used for online dating applications, user retention would want to be maximized and new

users would want to be drawn. As such, most people’s highest priority when using a

dating application would be to correctly match with someone.

With this being said, the accuracy of the model under this threshold is not as

optimal as that of the model when using the threshold determined by the glm function, as

it both maximizes sensitivity and specificity. Our accuracy is not as optimal as it could be,

however our sensitivity is also markedly higher. An increase in sensitivity is correlated

with a decrease in specificity, so there must be some loss accounted for when choosing a

threshold to maximize one of these statistics. In conclusion, approximately 70% accuracy

and approximately 70% AUC allows us the ability to forecast second dates to a

reasonable degree. Using the glm function without adjusting the threshold, however,

leads to a higher AUC.

classification via logistic regression

Documents