When Words Sweat:
Identifying Signals for Loan Default in the Text of Loan Applications
Oded Netzer Associate Professor of Marketing Columbia Business School Columbia University [email protected] Alain Lemaire Doctoral student Columbia Business School Columbia University [email protected] Michal Herzenstein Associate Professor of Marketing Lerner College of Business and Economics University of Delaware [email protected]
March 2018 Equal authorship. We thank Columbia Business School and the Lerner College at the University of Delaware for financial support.
2
When Words Sweat:
Identifying Signals for Loan Default in the Text of Loan Applications
Abstract
The authors present empirical evidence that borrowers, consciously or not, leave traces of their
intentions, circumstances, and personality traits in the text they write when applying for a loan.
This textual information has a substantial and significant ability to predict whether borrowers
will pay back the loan over and beyond the financial and demographic variables commonly used
in models predicting default. The authors use text-mining and machine-learning tools to
automatically process and analyze the raw text in over 120,000 loan requests from Prosper.com,
an online crowdfunding platform. The authors find that loan requests written by defaulting
borrowers are more likely to include words related to their family, mentions of God, the
borrower’s financial and general hardship, pleading lenders for help, and short-term focused
words. The authors further observe that defaulting loan requests are written in a manner
consistent with the writing style of extroverts and liars. Using a counterfactual analysis, the
authors demonstrate that applying their finding can yield a 9.7% additional return on investment.
Keywords: text mining, consumer finance, loan default
3
Imagine you consider lending $2,000 to one of two borrowers on a crowdfunding
website. The borrowers are identical in terms of their demographic and financial characteristics,
the amount of money they wish to borrow, and the reason for borrowing the money. However,
the text they provided when applying for a loan differs: Borrower #1 writes “I am a hard working
person, married for 25 years, and have two wonderful boys. Please let me explain why I need
help. I would use the $2,000 loan to fix our roof. Thank you, god bless you, and I promise to pay
you back.” while borrower #2 writes “While the past year in our new place has been more than
great, the roof is now leaking and I need to borrow $2,000 to cover the cost of the repair. I pay
all bills (e.g., car loans, cable, utilities) on time.” Who is more likely to default on her loan? This
question is at the center of our research, as we investigate the power of words in predicting loan
default. We claim and show that the text borrowers write at loan origination provides valuable
information that cannot be otherwise extracted from the typical data lenders have on borrowers
(which mostly include financial and demographic data), and that additional information is crucial
to predictions of default. The idea that text written by borrowers can predict their loan default,
builds on recent research showing that text is indicative of people’s psychological states, traits,
opinions, and situations (e.g., Humphreys and Jen-Hui Wang 2018; Matz and Netzer 2017), as
well as providing information on consumers’ and markets’ behaviors, such as consumers top of
mind associations (Netzer et al. 2012) and stock prices (Tirunillai and Tellis 2012).
In essence, the decision whether or not to grant a loan depends on the lender’s assessment
of the borrower’s ability to repay the loan. But this assessment is often difficult because loans are
repaid over a lengthy period of time, during which unforeseen circumstances may arise. For that
reason, traditional lenders (such as banks) and researchers collect and process as many pieces of
4
information as possible within this tightly regulated industry1. These pieces of information can
be classified into four categories: (1) The borrower’s financial strength, which is reflected by
one’s credit history, FICO score, income, and debt (Mayer, Pence, and Sherlund 2009), and is
most telling of the borrower’s ability to default (Avery et al. 2000); (2) Demographics, such as
gender or geographic location (Rugh and Massey 2010); (3) Information related to the loan
itself—the amount that is being borrowed and the interest rate (Gross et al. 2009); (4) Everything
else that can be learned from human interactions between borrowers and people at the loan
granting institutions. Indeed, Agarwal and Hauswald (2010) found that supplementing the loan
application process with the human touch of loan officers significantly decreases default rate due
to better screening and higher interpersonal commitment from borrowers. However, these
interactions are often laborious and expensive.
Indeed, in recent years, human interactions between borrowers and lenders have been
largely replaced by online lending platforms operated by banks, other lending institutions, or
crowdfunding platforms. In such environments, the role of both hard and soft pieces of
information becomes crucial. Accordingly, our main proposition in this paper is that the text
potential borrowers write when requesting an online crowdfunded loan provides additional
important information on the borrower, such as their intentions, personality, and true
circumstances—information that cannot be deduced from the financial and demographic data
alone and in a sense analogous to body language (or unspoken language) detected by loan
officers. Furthermore, because the evaluation of the borrower by a loan officer has been shown
to provide additional information over and beyond the hard financial information (Agarwal and
Hauswald 2010) we contend that in a similar vein, the text borrowers write when applying for a
1 Indeed, our personal discussions with data scientists at banks and other lending institutions suggest that they analyze as many as several thousands data points.
5
crowdfunded loan is predictive of loan default above and beyond all other available information.
This hypothesis extends the idea that our demeanor can be a manifestation of our true intentions
(DePaulo et al. 2003) into the text we write.
Going back to the opening example, to answer the question “who is more likely to
default?” we apply text-mining and machine-learning tools to a dataset of over 120,000 loan
requests from the crowdfunding platform Prosper. Using an ensemble stacking approach that
includes tree-based methods and regularized logistic regressions, we find that the textual
information significantly improves predictions of default, increasing lenders ROI over an
approach that uses only financial and demographics information by as much as 9.7%. To learn
which words, writing styles, and general ideas conveyed by the text, are more likely to be
associated with defaulted loan requests we further analyzed the data using a multi-method
approach including a naïve Bayes, an L1 regularization binary logistic model, a latent Dirichlet
analysis (LDA) analysis, and the Linguistic Inquiry and Word Count dictionary (LIWC;
Tausczik and Pennebaker 2010). Results across the first three analyses consistently show that
loan requests written by defaulting borrowers are more likely to include words (or themes)
related to the borrower’s family, financial and general hardship, mentions of God, and the near
future, as well as pleading lenders for help, and using verbs in present and future tenses.
Therefore, the text and writing style of borrower #1 in our opening example suggest this person
is more likely to default. In fact, all else equal, our analysis shows that based on the loan request
text, borrower #1 is approximately eight times more likely to default relative to borrower #2.
These analyses demonstrate the successful use of machine learning tools to go beyond merely
predictions with the objective of inferring the words, topics, and writing styles that are most
associated with a behavioral outcome. Additionally, we applied the LIWC dictionary to our data,
6
to allow a deeper exploration into the potential traits and states of borrowers. Our results suggest
that defaulting loan requests are written in a manner consistent with the writing styles of
extroverts and liars. We do not claim that defaulting borrowers were intentionally deceptive
when they wrote the loan request; rather, we believe their writing style may have reflected,
intentionally or not, doubts in their ability to repay the loan.
Our examination into the manifestation of consumers’ personalities and states in the text
they write during loan origination contributes to the fast-growing literature in consumer financial
decision making. Recently consumer researchers started to investigate factors that encourage or
inhibit consumer saving (Dholakia et al. 2016; Soman and Cheema 2011), debt acquisition
(Herzenstein, Sonenshein, and Dholakia 2011), and repayment (Amar et al. 2011). Most of these
investigations have been done on a smaller scale, such as with experimental participants or
smaller datasets. We add to this literature by showing, on a large scale and with archival data,
how participants in the crowdfunding industry can better assess the risk of default by interpreting
and incorporating soft unverifiable data (the words borrowers write) into their analysis.
The rest of this paper is organized as follows. In the next section we discuss the
limitations of structured financial and demographic data in the context of consumer financial
decisions and the opportunity offered by leveraging textual information. Specifically, drawing on
extant literature we argue that credit scores and demographics miss important information about
borrowers that lenders can learn from mining the text these borrowers write. We then delineate
the data, our text-mining and modeling approaches, and results. We conclude by interpreting our
results and generalizing our approach to other contexts.
WHAT DO HARD DATA MISS? THE OPPORTUNITY IN TEXT
Financial data such as credit scores have been used extensively to predict consumers’
7
credit riskiness, which is the likelihood they will become seriously delinquent on their credit
obligations over the next two years. FICO, the dominant credit scoring model, is used by 90% of
all financial institutions in the U.S. in their decision-making process (according to
myFICO.com). In calculating consumers’ scores, FICO takes into account a multitude of data
points including past and current loans, and credit cards—their allowance and utilization as well
as any delinquencies that were reported to the credit bureau (by companies such as cable TV, cell
phone providers etc.) However, while the usage of credit scores and financial information clearly
has its benefits in predicting consumer risk, these measures have been found to be insufficient
and even biased. For example, Avery et al. (2000) argue that the reliance of credit scores on
available consumers’ credit history misses important factors such as health status and length of
employment which are more forward looking in nature. Credit scores are a snapshot of the past
and do not reflect any (positive or negative) change that may occur in the future. The authors
find that those with low credit scores often benefit most from such additional data. Agarwal,
Skiba, and Tobacman (2009) provide further evidence for this deficiency of credit scores by
showing that the more flexible credit score systems, which weigh the different data sources that
comprise the score differently for each person, are eight times better at predicting loan default
than the more rigid systems such as the FICO score. Put differently, FICO scores might predict
future financial behavior well for some but not others. For example, if a person is facing some
unexpected health related expenditures that will ultimately drain their reserves and lower their
credit score, FICO will only know about it after it happens. A credit scoring system that
incorporates health status will predict the financial situation of that person better.
Analyzing the subprime mortgage default during 2006-7, Palmer (2015) and Sengupta
and Bhardwaj (2015) show that current contextual information such as loan characteristics (e.g.,
8
loan amount) and economic characteristics (e.g., housing prices) are often more predictive of
mortgage default than traditional financial measures such FICO scores and debt-to-income ratio.
Thus, credit scores are missing important information that may be predictive of consumers’
financial health and that the heavy reliance on such scores might blind lending organizations
from looking at additional sources of information.
Recognizing the limitation of financial measures, financial institutions and researchers
added other variables to their predictive models, mostly related to the individuals’ demographic,
geographic, and psychographic characteristics (e.g., Barber and Odean 2001; Pompian and Longo
2004). However, other characteristics, such as those related to emotional and mental states, as
well as personalities, have been found to be closely tied to financial behaviors and outcomes
(Norvilitis et al. 2006; Rustichini et al. 2016), yet are still missing from those predictive models.
The problem of reliance on purely structured financial and demographic information is even more
severe in the growing realm of crowdfunded unsecured loans, because human interaction between
lenders and borrowers is scarce. But, therein lies the problem. Personalities and mental states are
difficult, if not impossible, to infer from demographic or financial data alone, thus the pressing
need to go beyond such traditional data when attempting to predict behavior. We suggest and
demonstrate that the text borrowers write at loan origination can provide the much needed
supplemental information. Specifically, this textual information can be useful in understanding
not only the past behavior of the borrower, but also the present context of the loan and the
borrower’s future intentions.
WORDS, INDIVIDUAL CHARACTERISTICS, AND FINANCIAL BEHAVIORS
Our proposition that text can predict default builds on research in marketing, psychology,
linguistics, and finance that establishes two connections: (1) word usage and writing styles are
9
indicative of some stable inner traits as well as more transient states; and (2) these traits and
states affect people’s financial behaviors. In this section we provide evidence for these two
connections.
The premise that the text may be indicative of deeper traits and emotional states is
predicated on the idea that there is a systematic relationship between the words people use and
their personality traits (Fast and Funder 2008; Hirsh and Peterson 2009), identities (McAdams
2001), and emotional states (Tausczik and Pennebaker 2010). The relationship between word
usage and personality traits has been found across multiple textual media such as essays about
the self (Hirsh and Peterson 2009), blogs (Yarkoni 2010), social media (Schwartz et al. 2013),
and daily speeches (Mehl, Gosling, and Pennebaker 2006). This relationship stems from the
human tendency to tell stories and express internal thoughts and emotions through these stories,
which are essentially made possible by language. Therefore, even if the content might be similar
across different individuals, the manner in which they convey that content differs.
Research over the last two decades established the association between word usage with
the text originator’s personality, focusing on the big five personality traits—extraversion,
agreeableness, conscientiousness, neuroticism, and openness (Kosinski, Stillwell, and Graepel
2013; Pennebaker and Graybeal 2001; Pennebaker and King 1999), physical and mental health
(Preotiuc-Pietro et al. 2015), age and gender (Pennebaker and Stone 2003; Schwartz et al. 2013),
emotional state (Pennebaker, Mayne, and Francis 1997), impulsivity (Arya, Eckel, and Wichman
2013), and deception (Newman et al. 2003; Ott, Cardie, and Hancock, 2012).
The aforementioned literature conveys the potential in analyzing text—the ability to
extract a wealth of information otherwise unobtainable in many settings. However, without
understanding how this information can shed light on financial behavior, it has little to no value
10
to our purpose. Indeed, many researchers examined people’s financial decision making from a
psychological or behavioral lens. For example, Anderson et al. (2011) show that credit scores are
correlated with the big five personality traits. Specifically, they find negative relations between
extroversion and conscientiousness and credit scores. Extroverts wish to have exciting life styles
and hence sometimes spend beyond their means. The result for conscientiousness is a bit
surprising because these are diligent and responsible people, however their need for achievement
is high which might induce a pattern of spending that is larger than their means. Berneth, Taylor,
and Walker (2011) found that agreeableness is negatively correlated with FICO scores because
these people aim to please and are less likely to say no to unnecessary expenses. The extent to
which such traits are predictive of financial behavior over and beyond credit scores is an
empirical question, which we investigate in this research.
Other individual characteristics have been also found to be related to financial behavior.
Arya, Eckel, and Wichman (2013) find that FICO scores are correlated with personality, time
and risk preference, trustworthiness, and impulsiveness; Nyhus and Webley (2001) show that
emotional stability, autonomy, and extroversion are robust predictors of saving and borrowing
behaviors; and Norvilitis et al. (2006) show that debt is related to delay of gratification. Financial
behaviors such as saving, taking loans, and credit card usage, as well as overall scores such as
FICO have also been found to be correlated with education (Nyhus and Webley 2001), financial
literacy (Fernandes, Lynch, and Netemeyer 2014), number of hours people work (Berneth,
Taylor, and Walker 2011), stress and physical wellbeing (Netemeyer et al. 2018), self-regulation
(Freitas et al. 2002), and even the number of social media connections (Wei et al. 2015).
In sum, we postulate that many of the behavioral characteristics and situational factors that
have been found to be related to financial behaviors, such as personality traits and future
11
intentions, can be extracted from the text borrowers write in their loan request. Therefore, that text
is an invaluable addition to the hard financial and demographic data when predicting loan default.
SETTINGS AND DATA
We examine the value of text in predicting default using data from Prosper.com, the first
online crowdfunding platform and currently the second largest in the United States, with over 2
million members and $10 billion in funded unsecured loans. In prosper, potential borrowers
submit their request for a loan for a specific amount with a specific maximum interest rate they
are willing to pay, and lender then bid in a Dutch-like auction on the lender rate for loan. We
downloaded all loan requests posted between April 2007 and October 2008, a total of 137,952
listings. In October 2008, the Securities and Exchange Commission required Prosper to register
as a seller of investment, and when Prosper re-launched in July 2009 it made significant changes
to the platform. We chose data from the earlier days because it is richer and more diverse
particularly with respect to the textual information in the loan requests.
When posting a loan request on Prosper potential borrowers have to specify the loan
amount they wish to borrow (between $1,000 and $25,000 in our data), the maximum interest
rate they are willing to pay, and other personal information, such as debt to income ratio and
whether they are home owners. Prosper verifies all financial information including the potential
borrower’s credit score from Experian, and assigns each borrower a credit grade that reflects all
of this information. The possible credit grades are AA (lowest risk for lenders), A, B, C, D, E,
and HR (highest risk to lenders). Table A1 in the Web Appendix presents correspondence
between Prosper’s credit grades and FICO score. In addition, borrowers can upload as many
pictures as they wish, and use an open textbox to write any information they wish, with no length
restriction. The words borrowers write in that textbox are at the center of our research.
12
Because we are interested in predicting default, we focus on those loan requests that were
funded (19,446 requests), and of them only on those that included text (18,312 loans). The
default rate in our sample of loans is 33.1% (35% in all funded loans, n = 19,446).
We automatically text-mined the raw text in each loan application using the tm package
in R. Our textual unit is a loan application. For each loan application, we first tokenize each
word, a process that breaks down each loan application into the distinct words it contains. We
then use Porter’s stemming algorithm, to collapse variations of words into one. For example,
“borrower,” “borrowed,” “borrowing,” and “borrowers” become “borrow”. In total, the loan
requests in our dataset have over 3.5 million words, corresponding to 30,920 unique words that
are at least 3 letters long (we excluded from our analysis numbers and symbols).2 In addition to
words/stems we also look at two-word combinations (an approach often referred to as n-gram, in
which for n = 2, we get bi-grams). To reduce the dimensionality of the textual data and avoid
more obscure or infrequent words, we focus our analyses on the most frequent stemmed words
and bi-grams that appeared in at least 400 loan requests. We are left with 1,032 bi-grams.3
Textual, Financial, and Demographic Variables
Our dependent variable is loan repayment/default as reported by Prosper 4 (binary: 1 =
paid in full, 0 = defaulted). Our data horizon ends in 2008, and all Prosper loans at the time were
to be repaid over three years or less, therefore we know whether each loan in our database was
repaid or defaulted—there are no other options. The set of independent variables we use includes
textual, financial, and demographic variables. We elaborate on each group next.
2 Because of stemming, words with less than 3 words such as “I” may be kept due to longer stems (e.g., I’ve). 3 We checked the robustness of our analyses to increasing the number of words and bi-grams included in the analysis. Our results did not change qualitatively when we increased the number of bi-grams. 4 We classified a loan as “defaulted” if the loan status in Prosper is “Charge-off,” “Defaulted (Bankruptcy),” or “Defaulted (Delinquency).” We classified a loan as “paid” if it is labeled “Paid in full,” “Settled in full,” or “Paid.”
13
Textual variables. These variables include: (1) The number of characters in the title and
the textbox in the loan request. The length of the text has been associated with deception,
however the evidence is inconclusive. Hancock et al. (2007) showed that liars wrote much more
when communicating via text messages than non-liars. Similarly, Ott, Cardie, and Hancock
(2012) demonstrated that fake hospitality reviews are wordier though less descriptive. However,
in the context of online dating websites, Toma and Hancock (2012) showed that shorter profiles
indicate the person is lying, because they wished to avoid certain topics. (2) The percent of
words with six or more letters. This metric is commonly used to measure complex language,
education level, and social status (Tausczik and Pennebaker 2010). More educated people are
likely to have higher income and higher levels of financial literacy and hence are less likely to
default on their loan, relative to less educated people (Nyhus and Webley 2001). But the use of
complex language can also be risky if readers of the text perceive it to be artificially or
frivolously complex. Indeed, Oppenheimer (2006) demonstrated, in the context of admission
essays, that if complex vocabulary is used superfluously, the author may face a detrimental
outcome, suggesting the higher language was likely used deceptively. (3) The Simple Measure of
Gobbledygook (SMOG; McLaughlin, 1969), which measures writing quality by mapping it to
number of years of formal education needed to easily understand the text in first reading. (4) A
count of spelling mistakes based on the enchant spell checker using the Pyenchant 1.6.6. package
in Python. Harkness (2016) shows that spelling mistakes are associated with a lower likelihood
of granting a loan in traditional channels because it serves as a proxy for characteristics
correlated with lower income. (5) The 1,032 bi-grams from the open textbox in each loan
application following the text mining process described earlier.
Because loan requests differ in length, and words differ in the frequency of appearance in
14
our corpus, we normalize the frequency of a word appearance in a loan request to its appearance
in the corpus and the number of words in the loan request using the term frequency–inverse
document frequency, tf-idf, measure commonly used in information retrieval. The term
frequency for word j in loan request m is defined by 𝑡𝑓#$ = 𝑋#$ 𝑁$⁄ , where 𝑋#$ is the number
of times word j appears in loan request m, and 𝑁$ is the number of words in loan request m. This
component controls for the length of the document. The inverse-document-frequency is defined
by 𝑖𝑑𝑓# = 𝑙𝑜𝑔.𝐷 𝑀#⁄ 1, where D is the number of loan requests and 𝑀# is the number of loan
requests in which word j appears. This terms controls for the how often a word appears across
documents. Tf-idf is given by: 𝑡𝑓 − 𝑖𝑑𝑓#$ = 𝑡𝑓#$ × (𝑖𝑑𝑓# + 1). Taken together, the tf-idf statistic
provides a measure of how likely a word is to appear in a document over and beyond chance.
Financial and Demographic Variables. The second type of variables we consider are
financial and demographic information, commonly used in traditional risk models. These include
all information available to lenders on Prosper—loan amount, borrower’s credit grade (modeled
as a categorical variable AA-HR), debt to income ratio, whether the borrower is a home owner,
the bank fee for payment transfers, whether the loan is a relisting of a previous unsuccessful loan
request, and whether the borrower included a picture with the loan. In order to fully account for
all the information lenders have when viewing a loan request, we extracted information included
in the borrower’s profile pictures, such as gender (Male, Female and “Cannot Tell”), age
brackets (Young, Middle-aged, Old), and race (Caucasian, African American, Asian, Hispanic,
or “Cannot Tell”) using human coders. See Web Appendix for details of the coding procedure.5
Additionally, we controlled for the geographical location of the borrower to account for
5 In addition to coding the demographics, we asked our judges to provide a score for attractiveness and trustworthiness for each borrower based on the picture (similarly to Pope and Sydnor 2011). However, given the high degree of disagreements across raters we decided not to use these measures in our analyses.
15
differences in the economic environment that might have affected the borrower. We grouped the
borrowers’ states of residency into eight groups based on the Bureau of Economic Analysis
classification, and added a special armed forces group for Military personnel serving overseas.
Lastly, we included the final interest rate for each loan as a predictor in our model.6 Arguably, in
a financially efficient world, this final interest rate, which was determined using a bidding
process, should reflect all the information available to lenders (including the textual
information). However, it is possible that Prosper’s bidding mechanism allows for some strategic
behavior by sophisticated lenders, thus not fully reflecting a market efficient behavior (Chen,
Ghose, and Lambert 2014). Nevertheless, our models test whether the text is predictive over and
beyond the final interest rate. Table 1 presents summary statistics for the variables in our model.
*** Insert Table 1 about here ***
PREDICTING DEFAULT
Predictive Model (Stacking Ensemble)
Our objective in this section is to evaluate whether the text borrowers write in their loan
request is predictive of their loan default up to three years post origination. In order to do so, we
need to first build a strong benchmark—a powerful predictive model that includes the
traditionally used financial and demographics information and maximizes the chances of
predicting default using these variables. Second, we need to account for the fact that our model
may include a very large number of predictors (over one thousand bi-grams). In evaluating a
predictive model, it is common to compare alternative predictive models and choose the model
that best predicts the desired outcome—loan repayment in our case. From a purely predictive
6 An alterantive measure would be the maximum interest rate proposed by the borrower. However, because this measure is highly correlated the final lender rate, we include only the later in the model.
16
point of view, a better approach, commonly used in machine learning, is to train several
predictive models and rather than choose the best model, create an ensemble or stack the
different models. An ensemble of models benefits from the strength of each individual model
and, at the same time, reduces the variance of the prediction. Accordingly, for the purpose of
leveraging the textual information to predict default, we apply that approach.
The stacking ensemble algorithm includes two steps. In the first step, we train each model
on the calibration data. Because of the large number of textual variables in our model, we
employ a simultaneous variable selection and model estimation in the first step. In the second
step, we build a weighting model to optimally combine the models calibrated in the first step.
We consider five types of models in the first step. The models vary in terms of the
classifier used and the approach to model variable selection. The five models include two logistic
regressions and three versions of decision tree classifiers.7
Regularized Logistic Regressions (L1 and L2 Regularization). The two logistic
regressions are L1 and L2 regularization logistic regressions. The penalized logistic regression
likelihood is:
𝐿(𝑌|𝛽, 𝜆) = ∑ (𝑦@ log.𝑝(𝑋@|𝛽)1 + (1 −E@FG 𝑦@) log.1 − 𝑝(𝑋@|𝛽)1 − 𝜆𝐽(𝛽),
where 𝑌 = {𝑦G, … , 𝑦E} is the set of binary outcome variables for n loans (loan repayment),
𝑝(𝑋@|𝛽) is the probability of repayment based on the logit model, where 𝑋@ is a vector of textual,
financial and demographic predictors for loan t, 𝛽 are a set of predictors’ coefficients, l is a
tuning penalization parameter to be estimated using cross-validation on the calibration sample,
and 𝐽(𝛽) is the penalization term. The L1 and L2 models differ with respect to the functional
7 We also considered a forth type of decision tree (AdaBoost) as well as a Support Vector Machine classifer but dropped them due to poor performance on our data.
17
form of the penalization term, 𝐽(𝛽). In L1, 𝐽(𝛽) = ∑ |𝛽L|MLFG , while in L2, 𝐽(𝛽) = ∑ 𝛽L
NMLFG , where
k is the number of predictors. Therefore, L1 is the Lasso regression penalty and L2 is the ridge
regression penalty. Whereas L1 tends to shrink many of the regression parameters to exactly zero
and leave other parameters with no shrinkage, L2 tends to shrink many parameters to small but
non-zero values. Before entering the variables into the L1 and L2 regression we standardize all
variables (Tibshirani 1997).
Tree-based Methods (Random forest and Extra Trees). There are three tree-based
methods in the ensemble. We estimate two different Random Forest models, one with variance
selection and the second with best feature selection as well as Extremely Randomized Trees
(Extra Trees). Both models combine many decision trees, thus, each of these tree-based methods
is an ensemble in and of itself, and each tree is chosen to resolve misclassification of previously
included trees. The Random Forest randomly draws with replacements subsets of the calibration
data to fit each tree, and a random subset of features (variables) is used in each tree. In the
Variance Selection Random Forest features are chosen based on a variance threshold determined
by cross validation (80/20 split). In the K-Best Feature Selection Random Forest features are
selected based on a 𝜒N test. That is, we select the K-features with the highest 𝜒N score. We use
cross-validation (80/20 split) to determine the value of K. The Random Forest approach
mitigates the problem of over-fitting in traditional decision trees. The Extra Trees is an extension
of the Random Forest in which the cut-off point (the split) for each feature in the tree are also
chosen at random (from a uniform distribution) and the best split among them is chosen. Due to
the size of the feature space, we first apply a K-Best Feature Selection as described above to
select the features to be included in the Extra Trees (see Web Appendix for further details).
We used the scikit learn package in Python (http://scikit-learn.org/) to implement the five
18
classifiers on a random sample of 80% of the calibration data. For the logistic regressions, we
estimated the 𝜆 penalization parameter by grid search using a 3-fold cross validation on the
calibration sample. For the tree-based methods, to limit over-fitting of the trees, we randomized
the parameter optimization (Bergstra and Bengio 2012) using a 3-fold cross validation on the
calibration data to determine the structure of the tree (e.g., number leaves, number of splits,
depth of the tree, and criteria). We use a randomized parameter optimization rather an exhaustive
search (or a grid search) due to the large number of variables in our model. The parameters are
sampled from a distribution (uniform) over all possible parameter values.
Model Stacking and Predictions. In the second step, we estimate the weights for each
model to combine the ensemble of models using the remaining 20% of the calibration data. We
use a simple binary logistic model to combine the different predictive models. Thoughother
classifiersmaybeused,alogisticbinaryregressionmetaclassifierhelpsavoidoverfitting
andoftenresultsinsuperiorperformance (Whalen and Gaurav 2013). In our binary logistic
regression model, repayment is the dependent variable and the probabilities of repayment for each
loan by each of the five models in the ensembles from step 1 (the two logistic regularization
regressions and the three decision trees methods) as predictors. The estimated parameters of the
logistic regression provide the weights of each individual model in the ensemble. Specifically, the
ensemble repayment probability for loan j can be written as:
𝑝.𝑟𝑒𝑝𝑎𝑦𝑚𝑒𝑛𝑡#1 = exp.𝒙𝒋′𝒘1 /(1 + exp.𝒙𝒋′𝒘1),
where 𝒙𝒋 is the vector of repayment probabilities for each model s— 𝑝.𝑟𝑒𝑝𝑎𝑦𝑚𝑒𝑛𝑡#]𝑚𝑜𝑑𝑒𝑙^1
from step 1, and 𝒘are the estimated weights of each model in the logistic regression classifier.
We estimated an ensemble of the aforementioned five models. We find the following
weights for the different model: L1 = 0.040, L2 = 0.560, Random Forest K Best = 0.218, Random
19
Forest Variance Select = 0.116, and Extra Trees = 0.066. Thus, for our application we find that
the L2 model contributes most to the ensemble. We also estimated five predictive models, each
based on one specific models in the ensemble (L1 and L2 regularization logistic regressions, the
two Random Forest models and the Extra Trees model). We find that for each of the individual
predictive models the textual information significantly improves predictions in the validation
sample over and beyond the financial and demographics. However, the stacking ensemble model
had the best predictive ability (see Table A2 in the Web Appendix for predictive ability of each
individual models). Accordingly, we conclude that it is the textual information rather than the
specific machine learning predictive model used that is responsible for the improved predictive
ability of the model that include the textual information (Banko and Brill 2001). Next, we
describe the predictions based on the ensemble.
To test whether the text borrowers wrote in their loan requests is predictive of future
default, we use a 10-fold cross validation. We randomly split the loans into 10 equally sized
groups, calibrate the ensemble algorithm on nine groups and predict the remaining group. To
evaluate statistical significance, we repeated the 10-fold cross validation 10 times, using different
random seeds at each iteration. By cycling through the 10 groups and averaging the prediction
results across the 10 cycles and 10 replications we get a robust measure of 100 predictions.
Because there is no obvious cut-off for a probability from which one should consider the loan as
defaulted, we use the “area under the curve” (AUC) of the Receiver Operating Characteristic
(ROC) curve, a commonly used measure for prediction accuracy of binary outcomes. We further
report the Jaccard index (e.g., Netzer et al. 2012) of loan default, which is defined as the number
of correctly predicted defaulting loans divided by the total number of loans that were defaulted
but we incorrectly predicted, loans that were predicted to default but were repaid, and correctly
20
predicted defaulted loans. This gives us an intuitive measure of hit rates of defaulting loans
penalized for erroneous predictions of both type I and type II errors. Finally, building on research
that shows FICO scores have a lower predictive power of financial behavior for people with low
scores (Avery et al. 2000), we report predictions for high (AA, A), medium (B, C), and low (D,
E, HR) credit grades (while controlling for credit grades within each group).
We compare three versions of the ensemble: (1) a model calibrated only on the financial
and demographic data; (2) a model that includes just the textual information and ignores the
financial and demographic information, and (3) a model that includes financial and demographic
information together with the textual data. Comparing models (2) and (3) provides the
incremental predictive power of the textual information over predictors commonly used in the
financial industry. Comparing models (1) and (2) informs the degree of predictive information
contained in the textual information relative to the financial and demographic information.
Prediction Results
Table 2 details the average results of the 10-fold cross validation across 10 random
shuffling of the observations. The results we present are clean out-of-sample validation because
in each fold we calibrate feature selection, model estimates, and the ensemble weights on 90% of
the data and leave the remaining 10% of the data for validation. Table 2 presents the Area Unver
the ROC Curve (or AUC) and the Jaccard Index prediction measures and Figure 1 shows the
AUC graphically. Specifically, Figure 1 depicts the average ROC curve with and without the
textual data for one randomly chosen 10-fold cross validation. A better predictive model is the
one with a ROC curve that is closer to the upper left corner of the graph. The AUC of the model
with textual, financial, and demographics information is 2.89% better than the AUC of the model
with only financial and demographics information. This difference is statically significant. In
21
fact, the model with both textual and financial and demographics information has higher AUC in
all 100 replications of the cross-validation exercise. Breaking the sample by credit grade, we note
that the textual information significantly improves predictions across all credit grade levels.
However, the textual information is particularly useful in improving default predictions for
borrowers with low credit levels. This result is consistent with Avery et al. (2000) who find
FICO scores to be least telling of people’s true financial situation for those with low scores.
Interestingly, if we were to ignore the financial and demographic information and use
only the borrower textual information, we obtain an AUC of 66.68% compared to an AUC of
70.52% for the model with only financial and demographic information. That is, a brief,
unverifiable, “cheap talk” (Farrell and Rabin 1996), textual information provided by borrowers is
nearly as predictive as the traditional financial and demographic information. This result is
particularly impressive given the tremendous effort and expenditure involved in collecting the
financial information relative to the simple method used to collect the textual information. This
result may also suggest that textual information may be particularly useful in “thin file”
situations, where the financial information about consumers is sparse (Coyle 2017).
*** Insert Table 2 and Figure 1 about here ***
We conducted a back-of-the-envelope calculation to quantify the managerial relevance and
financial implications of the improvement in predictive ability offered by the textual data. For each
of the 18,312 granted loans we calculated the expected profit from investing $1,000 in each loan
based on the models with and without text. In calculating the expected profit, we assume that
borrowers who default repay on average 25% of the loan before defaulting (based on estimates
published by crowdfunding consulting agencies). The expected profits for loan j is:
𝐸.𝑝𝑟𝑜𝑓𝑖𝑡#1 = a1 − 𝑃𝑟𝑜𝑏.𝑟𝑎𝑝𝑦𝑚𝑒𝑛𝑡#1d × a−0.75 × 𝑎𝑚𝑜𝑢𝑛𝑡_𝑔𝑟𝑎𝑛𝑡𝑒𝑑# + 0.25 ×𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡_𝑒𝑎𝑟𝑛𝑒𝑑d + [𝑃𝑟𝑜𝑏.𝑟𝑒𝑝𝑎𝑦𝑚𝑒𝑛𝑡#1] × 𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡_𝑒𝑎𝑟𝑛𝑒𝑑,
22
where 𝑃𝑟𝑜𝑏.𝑅𝑒𝑝𝑎𝑦𝑚𝑒𝑛𝑡#1 is the probability of repayment of loan j based on the corresponding
model (with or without the text), 𝑎𝑚𝑜𝑢𝑛𝑡_𝑔𝑟𝑎𝑛𝑡𝑒𝑑# is the principal amount the lender grants for
loan j ($1,000 in our case), and 𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡_𝑒𝑎𝑟𝑒𝑛𝑒𝑑# is the interest rate paid to the lender based on
loan jth final interest rate, over three years for repaid loans and over three quarters of a year for
defaulted loans (for simplicity we do not time-discount payments in years 2 and 3). For each of
the two policies (based on the model with and without text) we sort the loans based on their
expected profit and select the top 1,000 loans with the highest expected return for each policy.
Finally, we calculate the actual profitability of each lending policy based on the actual default of
each loan in the data to calculate the return on the investment on the million dollars (1,000 loans
time $1,000 per loan). We find that the investment policy based on the model with the textual data
returns $96,528 more than the policy based on the financial and demographics information only.
This is an increase of 9.65% in the return on investment (ROI) on the million dollars invested.
Thus, while the improvement in default prediction for the model with the textual information
might seem modest (nearly 3%) even though it is statistically significant, the improvement in ROI
based on the textual information, is substantial and economically meaningful.
To summarize, the text borrowers write in their loan request can significantly help predict
loan default even when accounting for financial and demographic measures. The ensemble-based
predictive model was chosen to maximize predictive ability, but it provides little to no
interpretation of the parameter estimates, words, and topics that predict default. Therefore, in the
second part of this paper we present a series of analyses that shed light on the words and writing
styles that were most likely to appear in defaulting loans. These analyses also demonstrate how
machine learning approaches can be used beyond predictions and towards understanding of what
factors affect the improvement in prediction (Hofman, Sharma, and Watts 2017).
23
WORDS, TOPICS, AND WRITING STYLES THAT ARE ASSOCIATED WITH DEFAULT
The result that text has a predictive ability similar in magnitude to the predictive ability of
financial and demographic information is perhaps surprising. However, this result is consistent
with the idea that people who differ in the way they think and feel also differ in what they say and
write about those thoughts and feelings (Fast and Funder 2008; Hirsh and Peterson 2009). We
employed four approaches to uncover whether words, topics, and writing styles of defaulters
differ from those who repaid their loan. (1) We use a naïve Bayes classifier to identify the words
or bi-grams that most distinguish defaulted from fully-paid loans. The advantage of the naïve
Bayes is in providing intuitive interpretation of the words that are most discriminative between
defaulted and repaid loans; however, its disadvantage is that it assumes independence across
predictors and therefore cannot control for the financial and demographics variables (or for the
dependence among the textual variables). (2) To alleviate this concern, we use a logistic
regression with L1 penalization to uncover the words and bi-grams that are associated with
default after controlling for the financial and demographic information. The L1 regression results
corroborate the Naïve Bayes findings (correlation between the two analyses of results is 0.582, p
< 0.01. See details of the L1 regression in Table A3 in the Web Appendix). (3) To look beyond
specific words or bi-grams and into the topics discussed in each loan and writing styles employed,
we use a latent Dirichlet allocation (LDA) analysis. (4) Finally, relying on a well-known
dictionary, the Linguistic Inquiry and Word Count (LIWC; Tausczik and Pennebaker 2010), we
identify the writing styles that are most correlated with defaulting or repaying the loan.
Words that Distinguish between Loan Requests of Paying and Defaulting Borrowers
To investigate which words in the loan application most discriminate between borrowers
who default and borrowers who repay the loan in full, we ran a multinomial naïve Bayes classifier
24
using the Python scikit-learn 3.0 package on bi-grams (all possible words and pairs of words) that
appeared in at least 400 loans (1,032 bi-grams). The classifier uses Bayes rule and the assumption
of independence among words to estimate each word’s likelihood of appearing in defaulted and
paid loans. We then calculate the most “informative” bi-grams in terms of discriminating between
defaulted and repaid loans by calculating the bi-grams with the highest ratio of P(bi-
gram|defaulted)/P(bi-gram|repaid) and the highest ratio of P(bi-gram|repaid)/P(bi-gram|defaulted).
Table A4 in the Web Appendix presents the lists of words and their likelihood of
appearing in repaid versus defaulted loan requests (Table A4a) or defaulted versus repaid loan
requests (Table A4b). Figures 2 and 3 present word clouds of the naïve Bayes analysis of bi-
grams in their stemmed form. The size of each bi-gram in Figures 2 and 3 corresponds to the
likelihood that the bi-gram will be included in a repaid loan request versus a defaulted loan
request and in a defaulted loan request versus a repaid loan request, respectively. For example,
the word “reinvest” in Figure 2 is 4.8 times more likely to appear in a fully paid than a defaulted
loan request, while the word “god” is 2.0 times more likely to appear in a defaulted than a fully
paid loan request. The central cloud in each figure presents the most discriminant bi-grams
(cutoff ratio = 1.5) and the satellite clouds represent emerging themes based on our grouping of
these words.
*** Insert Figures 2, 3 around here ***
Several insights can be gained from this analysis. Relative to defaulters, borrowers who
paid in full were more likely to include in their loan application: (i) Words associated with their
financial situation such as “reinvest,” “interest,” and “tax;” (ii) Words that may be a sign of
projected improvement in financial ability: “graduate,” “wedding,” and “promote;” (iii) Relative
words such as “side,” “rather,” and “more than;” (iv) Long-term time related words such as
25
“future,” “every month,” and “few years;” (v) “I” words such as “I’d,” “I’ll,” “I’m.” The above
indicates that borrowers who paid in full may have nothing to hide, have a brighter future ahead
of them, and generally are truthful. The latter insight is based on research showing the use of
relative and time words as well as first person “I” words is associated with greater candor
because honest stories are usually more complex and personal (Newman et al. 2003). Dishonest
stories, on the other hand, are simpler, allowing the lying storyteller to conserve cognitive
resources in order to focus on the lie more easily (Tausczik and Pennebaker 2010). Borrowers
who repaid their loan also used words that indicate their financial literacy (e.g., “reinvest,” “after
tax,” and “minimum payment”). Indeed, higher financial literacy has been associated with lower
debt (Brown et al. 2015).
Turning to Figure 3, not surprisingly, borrowers who defaulted were more likely to
mention words related to (i) Financial hardships (“payday loan,” “child support,” and
“refinance,”) and general hardship (“stress,” “divorce,” and “very hard”). This result is in line
with Herzenstein, Sonenshein, and Dholakia (2011) who found that discussing personal hardship
in the loan application is associated with borrowers who are late on their loan payments. (ii)
Explaining their situation (“loan explain,” “explain why,”) and discussing their work state (“hard
work,” “worker”). Providing explanations is often connected to past deviant behavior
(Sonenshein, Herzenstein, and Dholakia 2011). (iii) Appreciative and good-manner words toward
lenders (“god bless,” “hello”) and pleading lenders for help (“need help,” “please help”). Why is
polite language more likely to appear in defaulted loan requests? One possibility is that it is not
authentic. Indeed, a recent study shows that rude people are more trustworthy because their
reactions seem more authentic (Feldman et al. 2017). (iv) Referring to external sources such as
“god,” “son,” “someone.” The strong reference to others has been shown to exist in deceptive
26
language style. Liars tend to avoid mentioning themselves, perhaps to distance themselves from
the lie (Hancock et al. 2007; Newman et al. 2003). With respect to the frequent mention of God in
defaulting loans, Kupor, Laurin, and Levav (2015), find that reminders of God can increase the
likelihood of people engaging in riskier behaviors, alluding to the possibility these borrowers took
a loan they were unable to repay. (v) Time related words (“total monthly,” “day,”) and future
tense words (“would use,” “will able”). While both paying and defaulting borrowers use time
related words, defaulters seem to focus on the shorter term (a month) while repayers on the longer
term (a year). This result is consistent with Lynch et al. (2010), who showed that long-term
planning (as opposed to short planning) was associated with lower procrastination and higher
degree of assignment completion. Further, the degree of long-term planning was associated with
FICO scores. The mention of shorter horizon time words by defaulters is also consistent with
Shah, Mullainathan, and Shafir (2012) who find that financial resource scarcity leads people to
shift their attention to the near future, neglecting the distant future, hence leading to over-
borrowing. The above words suggest that defaulting borrowers attempted to garner empathy,
seem forthcoming and appreciative, but when the time to repay the loan came they were unable to
escape their reality. We note that the themes most prevalent in defaulting loan requests construct a
narrative that is very similar to the “Nigerian email scam”, as described on the Federal Trade
Commission’s website: “Nigerian email scams are characterized by convincing sob stories,
unfailingly polite language, and promises of a big payoff.”8
While the naïve Bayes analysis is informative with respect to identifying words that are
associated with loan default, for a practical use, we may wish to uncover the words and financial
or demographic variables that are most predictive of default. To that end, we analyzed the
8 We thank Professor Gil Appel for this neat observation.
27
variables with the highest importance in predicting repayment based on the Random Forest
model used in the ensemble learning model (see Table A5 in the Web Appendix for the list of
120 variables that are most predictive of default). Not surprisingly, we find that the financial
variables such as lender rate and credit score are most predictive of default. In terms of the words
that are most predictive of default in the Random Forest, we find high degree of agreement
between this analysis and the naïve Bayes Analysis. For example, the words “payday loan,”
“invest,” “hard,” “thank you,” “explain,” “student,” were informative in both analyses.
The Relationship between Words that are Associated with Loan Default and Words that are
Associated with Loan Funding
In assessing which words are associated with default we focused only on funded loan.
However, it is possible that some borrowers strategically use some of these words to convince
lenders to fund their loans. The purpose of the following analysis is to examine whether lenders
are aware of such strategic behavior, and if not, which words sipped through lenders’ defenses
and got them to fund overly-risky loans that eventually defaulted. For example, is the common
use of polite language such as “thank you,” or “hello” in defaulted loans intended to convince
lenders to fund them? To investigate the relationship between words that were associated with
loan default and words that were related with loan funding, we ran a naïve Bayes analysis on the
entire set of loans requests (funded and unfunded requests, 122,479 loan requests in total),
assessing the bi-grams with highest ratio of P(bi-gram|funded)/P(bi-gram|unfunded) and the
highest ratio of P(bi-gram|unfunded)/P(bi-gram|funded). See Table A6 in the appendix for
summary statistics of this dataset and Table A7 for the naïve Bayes analysis results.
Figure 4 depicts for each bi-gram its value on the ratio P(bi-gram|defaulted)/P(bi-
gram|repaid) versus its value on the ratio P(bi-gram|unfunded)/P(bi-gram|funded). We named a
28
few representative bi-grams. A high correlation between the two ratios (P(bi-
gram|defaulted)/P(bi-gram|repaid) and P(bi-gram|unfunded)/P(bi-gram|funded) means that
lenders are largely aware of the words that are associated with default. Results show a fairly
strong correlation between the two ratios (r = 0.446, p < 0.01), suggesting that lenders are at least
somewhat rational when interpreting and incorporating the text in their funding decisions.
Examining the results more carefully, we find roughly three types of words: (1) words that have
high likelihood of default and low likelihood of funding or low likelihood of default and high
likelihood of funding (e.g., “chance,” and “bonus;”) (2) words that have a stronger impact on
loan funding than on loan default (e.g., words related to explanation, such as “loan explain,” and
“situation explain,”) and (3) words that “tricked” lenders to fund loans that were eventually
defaulted (words that were related to default more than to loan funding). The most typical words
in this category are “god” and “god bless”.
*** Insert Figure 4 around here ***
Analyzing the Topics Discussed in Each Loan Request and Their Relationship to Default
The Naïve Bayes analysis allowed meaningful insights into the discriminative power of
specific words or bi-grams between repaid and defaulted loans. In Figures 2 and 3 we grouped
the individual words into topics based on our interpretation and judgment. However, several
machine learning methods have been proposed to statistically combine words into topics based
their common co-occurrence in documents. Probably the most commonly used topic modeling
approach is the latent Dirichlet allocation analysis (LDA; Blei, Ng, and Jordan 2003). We apply
the LDA approach on the complete dataset of all loan requests (122,479).
We use the online variational inference algorithm for the LDA training (Hoffman, Bach,
and Blei 2010), following Griffiths and Steyvers (2004)’s settings and priors. We used the 5,000
29
word stems that appeared most frequently across loan requests. Eliminating infrequent words
mitigates the risk of rare-words occurrences and co-occurrence confounding the topics. Because
the LDA analysis requires the researcher to determine the number of topics to be analyzed, we
varied the number of topics between two and 30, and used model fit (the perplexity measure), to
determine the final number of topics. We find that the model with seven topics had the best fit
(lowest perplexity). Table 3 presents the seven topics and the most representative words for each
topic based on the relevance score of 0.5 (see Sievert and Shirley (2014) for details of the
relevance score), Table A8 in the Web Appendix has a more detailed list of the top 30 words for
each topic, and Figure A1 in the Web Appendix presents the perplexity analysis.
The topics we identify relate to the reason to request the loan, life circumstances, or
writing style. We find three loan purpose topics: Employment and School, Interest Rate
Consolidation, and Business and Real Estate. The other four topics are related to life
circumstances and writing style: Expenses Explanation, Family, Loan Details Explanation, and
Monthly Expenses. The monthly expenses topics are most likely related to a set of expenses
Prosper recommended borrowers mention as part of their loan request during our data period.
To relate the topics mentioned in each loan request’s text to the likelihood of default, we
ran a binary logistic regression with loan repayment = 1 and default = 0 as the dependent
variable, the probability of each topic appearing in the loan based on our LDA analysis (the topic
loan details explanation serves as benchmark), and the same set of textual information metrics as
well as the financial and demographic variables used in the ensemble learning model described
earlier. Table 4, presents the results of the binary logistic regression with the LDA topics. Before
we discuss the results related to the LDA topics we note that our controls, the financial and
demographic variables, are significant and in the expected direction. That is, repayment
30
likelihood is increasing as credit grades improve, but decreasing with debt to income ratio, home
ownership, loan amount, and lender rate. The strong relationship between lender rate and
repayment suggests some level of efficiency among Prosper lenders.
*** Insert Tables 3, 4 around here ***
Relative to the topic Loan Detail Explanation, we find that topics of Employment and
School, Interest Rate Reduction, and Monthly Payment are more likely to appear in repaid loan
requests. The Family topic, on the other hand, was less likely to appear in repaid loans.
Consistent with the naïve Bayes analysis we find that the topic of explaining one’s financials and
loan motives, and referring to family are associated with lower repayment likelihood. We also
find that tendency to detail the monthly expenses and financials is associated with higher
likelihood of repayment, perhaps because providing such information is indeed truthful and
forthcoming (having nothing to hide). Finally, although not the purpose of the LDA analysis, we
find that the binary logistic model that includes the textual information via LDA in addition to
the demographic and financial information fits the data and predicts default better than a model
that does not include the LDA topic probabilities (see Web Appendix for details).
In sum, multiple methods converged to uncovered themes and words that differentiate
defaulted from paid loan requests. Next, using a well-researched dictionary we explore whether
borrowers’ writing styles can shed light on the traits and states of defaulted borrowers.
Circumstances and Personalities of Those Who Defaulted
In this section we rely on one of the more researched and established text analysis
methods, the Linguistic Inquiry and Word Count (LIWC) dictionary. This dictionary groups
almost 4,500 words into 64 linguistic and psychologically meaningful categories such as tenses
(past, present, future), forms (I, we, you, she or he), social, positive, and negative emotions.
31
Since its release in 2001, many researchers have examined and employed it in their research (see
Tausczik and Pennebaker (2010) for a comprehensive overview). The result of almost two
decades of research is lists of word categories that represent the writing style of people with
different personalities (Kosinski, Stillwell, and Graepel 2013; Pennebaker and King 1999;
Schwartz et al. 2013; Yarkoni 2010), mental health states (Preotiuc-Pietro et al. 2015), emotional
states (Pennebaker, Mayne, and Francis 1997), as well as many other traits and states.
We chose LIWC over other dictionaries because it is context-free. Other dictionaries
(e.g., DICTION and Loughran and McDonald (2011)) have been used to interpret financial
related text, such as quarterly and annual financial reports. However, because we are interested
in individuals, rather than firms, describing their finances these dictionaries are less appropriate
for our data. Another recent dictionary, developed by Schwartz et al. (2013), is also inadequate
for our purposes because it is based on Facebook (and later used successfully on Twitter;
Preoţiuc-Pietro et al. 2015) and thus includes many items related to partying, drinking, and
swearing—words that rarely appear in the text we analyze.
LIWC is composed of sub-dictionaries. The same word can appear in several sub-
dictionaries. We first calculate the proportion of stemmed words in each loan request that belong
to each of the 64 dictionaries.9 We then estimated a binary logit model (load repaid = 1 and loan
defaulted = 0) to relate the proportions of words in each loan that appear in each dictionary to
whether the loan was repaid, controlling for all financial and demographic variables used in our
previous analyses. Results are presented in Table 5.
*** Insert Table 5 around here ***
We begin by noting that all financial and demographic control variables are in the
9 For this analysis we did not remove words with less than three characters and infrequent words as we are matching words to pre-defined dictionaries.
32
expected direction and are consistent with our LDA results. Fourteen of the sixty-four LIWC
dictionaries were significantly related to repayment behavior. Several of them corroborate our
results from the naïve Bayes and LDA analyses. To interpret our findings, we rely on previous
research that leveraged the LIWC dictionary, which allow us to conclude that defaulted loan
requests contain words that are associated with the writing style of liars and of extroverts.
We begin with deception. Looking at Table 5, we see that the following LIWC sub-
dictionaries that have been shown to be associated with greater likelihood of deception, are
associated in our analysis with greater likelihood to default: (1) present and future tense words.
This result is similar to our naïve Bayes findings. Indeed, past research shows liars are more
likely to use present and future tense words because they represent unconcluded situations
(Pasupathi 2007); (2) higher use of motion words (e.g., “drive,” “go,” and “run,”) which have
been associated with lower cognitive complexity (Newman et al. 2003), and lower use of relative
words (e.g., “closer,” “higher,” and “older,”) which are associated with higher complexity
(Pennebaker and King 1999). Deceptive language has been shown to include more motion words
and fewer relative words because it is less complex in nature; (3) Similar to our finding from the
naïve Bayes analysis that defaulters tend to refer to others, we find that social words (e.g.,
“mother,” “father,” “he,” “she,” “we,” and “they”) are associated with higher likelihood of
default. Along these lines, Hancock et al. (2007) showed that linguistic writing style of liars is
reflected by lower use of first person singular and higher use of first person plural such as “we”
(See also Bond and Lee 2005; Newman et al. 2003). We note that in the context of hotel reviews,
Ott, Cardie, and Hancock (2012) find higher use of “I” in fake reviews possibly because they did
not have much to write about the hotel itself (because they have never been there) so they
described their own activities. Our Naïve Bayes finding shows more “I” words in repaid loans;
33
(4) time words (e.g., “January,” “Sunday,” “morning,” and “never”) and space words (e.g.,
“above,” “inch,” and “north”) were associated with higher likelihood of default. These words
have been found to be prevalent in deceptive statements written by prisoners because the use of
such words seem to draw attention away from the self (Bond and Lee 2005).
Taken together, we find that several of the LIWC dictionaries that have been previously
found to associate with deception are also negatively associated with loan repayment (positively
associated with loan default). But, do borrowers intentionally lie to lenders? One possibility is
that people can predict, with some accuracy, future default, months, or even years ahead. If so,
this would support the idea that defaulters are writing either intentionally or unconsciously to
deceive lenders. However, there is another option: borrowers on Prosper.com might genuinely
believe that they will be able to pay the borrowed money in full. Indeed, past research suggests
that people are often optimistic of future outcomes (Weinstein 1980). What there may be hiding
from lenders is the degree of their difficult situations and circumstances.
Our second observation is that the sub-dictionaries associated with the writing style of
extroverts are also associated with greater likelihood to default. Extroverts have been shown to
use more religious and body related words (e.g., “mouth,” “rib,” “sweat,” and “naked;” Yarkoni
2010), social and humans words (e.g., “adults,” “boy,” and “female;” Hirsh and Peterson 2009;
Pennebaker and King 1999; Schwartz et al. 2013; Yarkoni 2010), motion words (e.g., “drive,”
“go,” and “run;” Schwartz et al. 2013), achievement words (e.g., “able,” “accomplish,” and
“master,”) and fewer filler words (e.g., “blah” and “like;” Mairesse et al. 2007)—all of which are
significantly related to a greater likelihood of default in our analysis (see Table 5).
The finding that defaulters are more likely to exhibit writing style of extroverts is
consistent with research showing that extroverts are more likely to take risks (Nicholson et al.
34
2005), engage in compulsive buying of lottery tickets (Balabanis 2002), and are less likely to
save (Brandstätter 2005; Nyhus and Webley 2001). Moreover, it is not a coincidence that the
fourteen LIWC dictionaries that were significantly correlated with default are correlated with
both extroversion and deception. Past literature has consistently documented that extroverts are
more likely to lie, and not only because they talk to more people but rather because these lies
help smooth their interactions with others (Weiss and Feldman 2006).
We did not find consistent and conclusive relationship between the LIWC dictionaries
associated with each of the other big five personality traits and loan repayment. Similarly, results
from other research on the relationship between LIWC and gender, age, mental, and emotional
states did not consistently relate to default in our study. Finally, we acknowledge that there may
be variables that are confounded with both the observable text and unobservable personality traits
or states that are accountable for the repayment behavior. Nevertheless, from a predictive point of
view, we find that the model that includes the LIWC dictionaries fits the data and predicts default
better than a model that does not include the textual information (see Web Appendix for details).
GENERAL DISCUSSION
The words we write matter. Aggregated text has been shown to predict market trends
(Bollen, Mao, and Zeng 2011) and stock market behavior (Tirunillai and Tellis 2012), market
structure (Netzer et al. 2012), virility of news articles (Berger and Milkman 2012), prices of
services (Jurafsky et al. 2014), and political elections (Tumasjan et al. 2010). At the individual
text writer level, text has been used to evaluate the state of mind of writers (Ventrella 2011), to
identify liars (Newman et al. 2003) and fake reviews (Ott, Cardie, and Hancok 2012), to assess
personality traits (Schwartz et al. 2013; Yarkoni 2010), and the mental state of those who Tweet
(Preotiuc-Pietro et al. 2015). In this paper, we show that text has the ability to predict financial
35
behavior of its writer in the distant future with significant accuracy.
Using data from an online crowdfunding platform we show that incorporating the text
borrowers write in their loan application into traditional models that predict loan default based on
financial and demographic information about the borrower significantly and substantially
increases their predictive ability. Using machine learning methods such as naïve Bayes analysis,
regularized regression, and LDA analysis, we uncover the words and topics borrowers often
include in their loan request. We find that at loan origination, defaulters used simple but wordier
language, wrote about hardship, further explained their situation and why they need the loan, and
tended to refer to other sources such as their family, God, and chance. Building on past research
and the commonly used LIWC dictionary we infer that defaulting borrowers write similarly to
people who are extroverts and to those who lie. These results were obtained after controlling for
the borrower’s credit grade, which should capture the financial implications of the borrower’s
life circumstances, and the interest rate given to the borrower, which should capture differences
in the risk of different types of loans and borrowers. Simply put, we show that borrowers,
consciously or not, leave traces of their intentions, circumstances, and personality in the text they
write when applying for a loan—a sort of online “involuntary sweat”.
Theoretical and Practical Contribution
Our research makes the following theoretical and practical contributions. First, our work
contributes to the recent but growing marketing literature on uncovering behavioral insights on
consumers from the text they write and the traces they leave on social media (Humphreys and Jen-
Hui Wang 2018; Matz and Netzer 2017). We demonstrate that text consumers write at loan
origination is indicative of their states and traits, and predictive of their future repayment behavior.
In an environment characterized by high uncertainty and high stakes, we find that verifiable and
36
unverifiable data have similar predictive ability. While borrowers can truly write whatever they
wish in the textbox of the loan application—supposedly “cheap talk”—their word usage is
predictive of future repayment behavior at a similar scale as their financial and demographic
information. This finding implies that whether it is intentional and conscious or not, borrowers’
writings seem to disclose their true nature, intentions, and circumstances. This finding contributes
to the literature on implication and meaning of word usage by showing that people with different
economic and financial situations use words differently.
Second, we make an additional contribution to the text analytics literature. The text-
mining literature has primarily concentrated on predicting behaviors that occur at the time of
writing the text, such as lying about past events (Newman et al. 2003) fake review (Ott, Cardie
and Hancock. 2012), but not on predicting future behavior of writers. In one exception, Slatcher
and Pennebaker (2006) show that couples who used more positive emotion words when texting
to each other were more likely to continue dating three months later. The behavior we are
documenting often occurs years after the text was written.
Third, our approach to predicting default relies on an automatic algorithm that mines
individual words, including those without much meaning (e.g., articles and fillers), in the entire
textual corpora. Past work on narratives used to facilitate economic transaction (e.g., Chen, Yao,
and Kotha 2009; Herzenstein, Sonenshein, and Dholokia 2011; Martens, Jennings, and Jennings
2007), employed human coders and therefore is prone to human mistakes, is not scalable, which
limits its predictive ability and practical use. It may not be surprising that people who describe
hardship in the text at loan origination are more likely to default—as past research has shown.
However, our work finds that defaulters and repayers also differ in their usage of pronouns and
tenses, and those seemingly harmless pronouns have the ability to predict future economic
37
behaviors. Research shows that unless agents are very mindful regarding their usage of
pronouns, it is unlikely that they noticed (not to mention planned) to use specific pronouns in
order to manipulate the reader (Pennebaker 2011)—lenders in our case.
Fourth, we provide evidence that our method of automatically analyzing free text is an
effective way of supplementing traditional measures and even replacing some aspects of the
human interaction of traditional bank loans. Textual information, such as the one we analyze, can
not only shed light on past behavior (which may be captured by measures such as FICO scores)
and provide to it context, but also provide information about the future, which is unlikely to be
captured by FICO scores and credit history reports. These future events may be positive (e.g.,
graduating and accepting a new job) or negative (e.g., impending medical cost), and certainly
affect lending decisions.
Furthermore, because lending institutions place a great deal of emphasis on the
importance of models for credit risk measurement and management, they have developed their
own proprietary models, which often have a high price tag due to data acquisition (Bloomberg
02/2013). Collecting text may be an effective and low-cost supplement to the traditional financial
data and default models. Such endeavors may be particularly useful in “thin file” situations
where historical data about customers’ finances is sparse. That being said, although our objective
in this research is to explore the predictive and informative value of the text in loan applications,
using such information to decide which borrowers should granted loans may carry both ethical
and legal considerations within the restrictions of this highly regulated industry.
Avenues for Future Research
Our research takes the first step in automatically analyzing text in order to predict default,
and therefore initiates multiple research opportunities. First, we focus on predicting default
38
because this is an interesting behavior that is less idiosyncratic to the crowdfunding platform
whose data we analyze (compared with lending decisions). Theoretically, many aspects of loan
repayment behavior, which are grounded in human behavior (e.g., extroversion; Nyhus and
Webley 2001), should be invariant to the type of loan and lending platform, whereas other
aspects may vary by context. Our research should be extended to different populations, other
types of unsecured loans (e.g., credit card debt), and secured loans, (e.g., mortgages).
Second, our results should be validated in and extended to other types of media and
communication, such as phone calls or online chats. It would be interesting to test how an active
conversation—two-sided correspondence—versus only one-sided input as in our data may affect
the results. In our data, the content of the text is entirely up to borrowers—they disclose what
they wish in whichever writing style they choose. In a conversation, the borrower may be
prompted to provide certain information. Nevertheless, we predict that many of our findings will
endure because they are in the realm of pronouns whose usage is mostly subconscious.
Third, we document specific words and themes that might help lenders avoid defaulting
borrowers, and help borrowers better express themselves in requesting the loan. Based on the
market efficiency hypothesis, if both lenders and borrowers internalize the results we
documented, these results may change. In other words, the Lucas critique (Lucas 1976) that
historical data cannot predict a change in economic policy, because future behavior changes as
policy changes, may apply to our situation. We wish to emphasize two aspects. First, an
important objective of our study is to explore and uncover the behavioral signals or traces in the
text that are associated with loan default. This objective is descriptive rather than prescriptive
and hence not susceptible to the Lucas critique. Second, for the Lucas critique to manifest three
conditions need to occur: (1) the research finding need to fully disseminate, (2) the agent needs
39
to have sufficient incentive to change her behavior, and (3) the agent needs to be able to change
her behavior based on the proposed findings (Van Herdee, Dekimpe, and Putsis 2005). With
respect to the first point, evidence from deception detection mechanisms (Li et al. 2014) as well
as macroeconomics (Rudebusch 2005) suggest that dissemination of research results rarely fully
affect behavior. Moreover, recall that borrowers in our context vary substantially in terms their
socio-economic and education level, which makes it unlikely that any published result will fully
change the behavior of all or even the majority of the agents. As for the second point, borrowers
indeed have a strong incentive to change their behavior due to the decision importance.
However, they might not necessarily be able to do that, which is related to the third point. Even
if our results are fully disseminated, people use of pronouns and tenses (the finding at the heart
of our research) has been found to be largely subconscious and changing it demands heavy
cognitive load (Pennabaker 2011). Nevertheless, we encourage future research to examine how
word usage of defaulters change over time.
Finally, while we are studying the predictive ability of written text regarding a particular
future behavior, our approach can be easily extended to other behaviors and industries. For
example, universities might be able to predict students’ success based on the text in the
application (beyond people manually reading the essays). Similarly, human resource departments
and recruiters can use the words in the text applicants write to identify promising candidates.
To conclude, borrowers leave meaningful signals in the text of loan applications that help
predict default, sometimes years post loan origination. Our research adds to the literature utilizing
text mining to better understand consumer behavior (Humphreys and Jen-Hui Wang 2018), and
especially in the realm of consumer finance (Herzenstein, Sonenshein, and Dholakia 2011).
40
REFERENCES Agarwal, Sumit and Robert Hauswald (2010), “Distance and Private Information in Lending,”
Review of Financial Studies, 23(7), 2757-2788.
Agarwal, Sumit, Paige M. Skiba, and Jeremy Tobacman (2009), “Payday Loans and Credit Cards:
New liquidity and Credit Scoring Puzzles” American Economic Review, 99(2),412-17.
Amar, Moty, Dan Ariely, Shahar Ayal, Cynthia E. Cryder, and Scott I. Rick (2011), “Winning the
Battle but Losing the War: The Psychology of Debt Management,” Journal of Marketing
Research, 48(SPL), S38-S50.
Anderson, Jon, Stephen Burks, Colin DeYoung, and Aldo Rustichini (2011), “Toward the
Integration of Personality Theory and Decision Theory in the Explanation of Economic
Behavior.” Presented at the IZA workshop: Cognitive and non-cognitive skills.
Arya, Shweta, Catherine Eckel, and Colin Wichman (2013), “Anatomy of the Credit Score,”
Journal of Economic Behavior and Organization 95, 175-185.
Avery, Robert B., Raphael W. Bostic, Paul S. Calem, and Glenn B. Canner (2000), “Credit
Scoring: Statistical Issues and Evidence from Credit-Bureau Files,” Real Estate Economics,
28(3), 523-547.
Banko, Michele and Eric Brill (2001), “Scaling to Very Very Large Corpora for Natural Language
Disambiguation." In Proceedings of the 39th annual meeting on association for computational
linguistics, Association for Computational Linguistics, 26-33.
Barber, Brad M. and Terrance Odean (2001), “Boys will be Boys: Gender, Overconfidence, and
Common Stock Investment,” The Quarterly Journal of Economics, 116 (1), 261-292.
Berger Jonah, and Katherine L. Milkman (2012), “What Makes Online Content Viral?” Journal of
Marketing Research: 49 (2), 192-205.
Bergstra, James and Yoshua Bengio (2012), "Random Search for Hyper-Parameter
Optimization." Journal of Machine Learning Research, 13 (Feb), 281-305.
Berneth, Jeremy, Shannon G. Taylor, and Harvell Jackson Walker (2011), “From Measure to
Construct: An Investigation of the Nomological Network of Credit Scores” In Academy of
Management Proceedings, 1-6.
Blei, David M., Andrew Y. Ng, and Michael I. Jordan (2003), “Latent Dirichlet
Allocation,” Journal of Machine Learning Research, 3(Jan), 993-1022.
Bollen, Johan, Huina Mao, and Xiaojun Zeng (2011), “Twitter Mood Predicts the Stock Market,”
41
Journal of Computational Science, 2 (1), 1-8.
Bond, Gary D. and Adrienne Y. Lee (2005), “Language of Lies in Prison: Linguistic Classification
of Prisoners’ Truthful and Deceptive Natural Language,” Applied Cognitive Psychology, 19 (3),
313-329.
Brandstätter, Hermann (2005), “The Personality Roots of Saving—Uncovered from German and
Dutch Surveys,” Consumers, Policy and the Environment, 65-87.
Brown, Meta, John Grigsby, Wilbert van der Klaauw, Jaya Wen, and Basit Zafar (2015), “Financial
Education and the Debt Behavior of the Young,” Federal Reserve Bank (New York), Report #634.
Chen, Ning, Arpita Ghosh, and Nicolas S. Lambert (2014), “Auctions for Social Lending: A
Theoretical Analysis,” Games and Economic Behavior, 86, 367-391.
Chen, Xiao-Ping, Xin Yao, and Suresh Kotha (2009), “Entrepreneur Passion and Preparedness in
Business Plan Presentations: A Persuasion Analysis of Venture Capitalists’ Funding Decisions,”
Academy of Management Journal, 52 (1), 199-214.
Coyle, Tim (2017), “The Thin Filed, Underbanked: An Untapped Source of Buyers,” Credit Union
Times Magazine, September 17 issue.
DePaulo, Bella M., James J. Lindsay, Brian E. Malone, Laura Muhlenbruck, Kelly Charlton, and
Harris Cooper (2003), “Cues to Deception,” Psychological Bulletin, 129 (1), 74-118.
Dholakia, Utpal, Leona Tam, Sunyee Yoon, and Nancy Wong (2016), "The Ant and the
Grasshopper: Understanding Personal Saving Orientation of Consumers," Journal of Consumer
Research, 43 (1), 134-155.
Farrell, Joseph and Matthew Rabin (1996), “Cheap Talk,” The Journal of Economic Perspectives,
10 (3), 103-118.
Fast, Lisa A. and David C. Funder (2008), “Personality as Manifest in Word Use: Correlations with
Self-Report, Acquaintance Report, and Behavior,” Journal of Personality and Social
Psychology, 94 (2), 334-346.
Feldman, Gilad, Huiwen Lian, Michal Kosinski, and David Stillwell (2017), “Frankly, We Do Give
A Damn: The Relationship Between Profanity and Honesty,” Social Psychological and
Personality Science, 8 (7), 816-826.
Fernandes, Daniel, John G. Lynch Jr, and Richard G. Netemeyer (2014), “Financial Literacy,
Financial Education, and Downstream Financial Behaviors," Management Science, 60 (8),
1861-1883.
42
Freitas, Antonio L., Nira Liberman, Peter Salovey, and E. Tory Higgins (2002), “When to Begin?
Regulatory Focus and Initiating Goal Pursuit,” Personality and Social Psychology Bulletin
28(1), 121-130.
Griffiths, Thomas L. and Mark Steyvers (2004), “Finding Scientific Topics,” Proceedings of the
National academy of Sciences, 101 (1), 5228-5235.
Gross, Jacob P.K., Cekic Osman, Don Hossler, and Nick Hillman (2009), “What Matters in Student
Loan Default: A Review of the Research Literature," Journal of Student Financial Aid, 39 (1),
19-29.
Hancock, Jeffrey T., Lauren E. Curry, Saurabh Goorha, and Michael Woodworth (2007), “On
Lying and Being Lied to: A Linguistic Analysis of Deception in Computer-Mediated
Communication,” Discourse Processes, 45 (1), 1-23.
Hansen, Stephen, Michael McMahon, and Andrea Prat (2014), "Transparency and Deliberation within
the FOMC: A Computational Linguistics Approach,” Working Paper, Columbia University.
Harkness, Sarah K. (2016), “Discrimination in Lending Markets: Status and the Intersections of
Gender and Race,” Social Psychology Quarterly, 79 (1), 81-93.
Herzenstein, Michal, Scott Sonenshein, and Utpal M. Dholakia (2011), “Tell Me a Good Story and
I May Lend You My Money: The Role of Narratives in Peer-To-Peer Lending Decisions,”
Journal of Marketing Research, 48(SPL), S138-S149.
Hirsh, Jacob B. and Jordan B. Peterson (2009), “Personality and Language Use in Self-Narratives,”
Journal of Research in Personality, 43 (3), 524-527.
Hofman, Jake M., Amit Sharma, and Duncan J. Watts (2017), “Prediction and Explanation in Social
Systems,” Science, 355 (6324), 486-488.
Hoffman, Matthew, Francis R. Bach, and David M. Blei (2010) “Online Learning for Latent
Dirichlet Allocation,” Advances in Neural Information Processing Systems, 856-864.
Humphreys, Ashlee and Rebecca Jen-Hui Wang (2018), “Automated Text Analysis for Consumer
Research,” Journal of Consumer Research, forthcoming.
Jurafsky, Dan, Victor Chahuneau, Bryan R. Routledge, and Noah A. Smith (2014), “Narrative
Framing of Consumer Sentiment in Online Restaurant Reviews,” First Monday, 19 (4).
Kosinski, Michal, David Stillwell, and Thore Graepel (2013), “Private Traits and Attributes are
Predictable from Digital Records of Human Behavior,” Proceedings of the National Academy of
Sciences, 110 (15), 5802-5805.
43
Kupor, Daniella M., Kristin Laurin, and Jonathan Levav (2015), “Anticipating Divine Protection?
Reminders of God Can Increase Nonmoral Risk Taking,” Psychological Science, 26(4), 374-84.
Li, Jiwei, Myle Ott, Claire Cardie, and Eduard Hovy (2014), “Towards a General Rule for
Identifying Deceptive Opinion Spam,” Proceedings of the 52nd Annual Meeting of the
Association for Computational Linguistics, 1566-1576.
Loughran, Tim and Bill McDonald (2011), “When is a Liability Not a Liability? Textual Analysis,
Dictionaries, and 10-Ks,” The Journal of Finance, 66 (1), 35-65.
Lucas, Robert (1976), “Econometric Policy Evaluation: A Critique,” The Phillips Curve and Labor
Markets, Carnegie-Rochester Conference Series on Public Policy. New York: Elsevier, 19–46.
Lynch, John G., Richard G. Netemeyer, Stephen A. Spiller, and Alessandra Zammit (2010), “A
Generalizable Scale of Propensity to Plan: The Long and the Short of Planning for Time and for
Money,” Journal of Consumer Research 37(1), 108-128.
Mairesse, François, Marilyn A. Walker, Matthias R. Mehl, and Roger K. Moore (2007), “Using
Linguistic Cues for the Automatic Recognition of Personality in Conversation and Text,”
Journal of Artificial Intelligence Research, 30, 457-500.
Martens, Martin L., Jennifer E. Jennings, and P. Devereaux Jennings (2007), “Do the Stories They
Tell Get Them the Money They Need? The Role of Entrepreneurial Narratives in Resource
Acquisition,” Academy of Management Journal, 50 (5), 1107-1132.
Matz, Sandra and Oded Netzer (2017), “Using Big Data as a Window into Consumers’
Psychology,” Current Opinion in Behavioral Sciences, 18 (December), 7-12,
Mayer, Christopher, Karen Pence, and Shane M. Sherlund (2009), “The Rise in Mortgage
Defaults." The Journal of Economic Perspectives, 23 (1), 27-50.
McLaughlin, G. Harry (1969), “SMOG Grading—A New Readability Formula,” Journal of
Reading, 12(8), 639-646.
McAdams, Dan P. (2001), “The Psychology of Life Stories,” Review of General Psychology, 5(2),
100-123.
Mehl, Matthias R., Samuel D. Gosling, and James W. Pennebaker (2006), “Personality in its
Natural Habitat: Manifestations and Implicit Folk Theories of Personality in Daily Life,”
Journal of Personality and Social Psychology, 90 (5), 862-877.
Netemeyer, Richard G., Dee Warmath, Daniel Fernandes, and John Lynch Jr. (2018), “How Am I
Doing? Perceived Financial Well-Being, Its Potential Antecedents, and Its Relation to Overall
44
Well-Being,” Journal of Consumer Research, Forthcoming.
Netzer, Oded, Ronen Feldman, Jacob Goldenberg, and Moshe Fresko (2012), “Mine Your Own
Business: Market-Structure Surveillance through Text Mining,” Marketing Science, 31(3), 521-43.
Newman, Matthew L., James W. Pennebaker, Diane S. Berry, and Jane M. Richards (2003), “Lying
Words: Predicting Deception from Linguistic Styles,” Personality and Social Psychology
Bulletin, 29 (5), 665-675.
Nicholson, Nigel, Emma Soane, Mark Fenton-O'Creevy, and Paul Willman (2005), “Personality
and Domain-Specific Risk Taking,” Journal of Risk Research, 8 (2), 157-176.
Norvilitis, Jill M., Michelle M. Merwin, Timothy M. Osberg, Patricia V. Roehling, Paul Young,
and Michele M. Kamas (2006), “Personality Factors, Money Attitudes, Financial Knowledge,
and Credit-card Debt in College Students,” Journal of Applied Social Psychology, 36(6), 1395.
Nyhus, Ellen K., and Paul Webley (2001), “The Role of Personality in Household Saving and
Borrowing Behaviour,” European Journal of Personality, 15 (S1), S85-S103.
Oppenheimer, Daniel M. (2006), “Consequences of Erudite Vernacular Utilized Irrespective of
Necessity: Problems with Using Long Words Needlessly,” Applied Cognitive Psychology, 20
(2), 139-156.
Ott, Myle, Claire Cardie, and Jeff Hancock (2012), “Estimating the Prevalence of Deception in
Online Review Communities,” In Proceedings of the 21st International Conference on World
Wide Web, 201-210.
Palmer, Christopher (2015), “Why did so many subprime borrowers default during the crisis: Loose
credit or plummeting prices?” Working Paper, University of California Berkeley.
Pasupathi, Monisha (2007), “Telling and the Remembered Self: Linguistic Differences in Memories
for Previously Disclosed and Previously Undisclosed Events,” Memory, 15 (3), 258-270.
Pennebaker, James W. (2011), “The Secret Life of Pronouns,” New Scientist, 211(2828), 42-45.
Pennebaker, James W. and Lori D. Stone (2003), “Words of Wisdom: Language Use over the Life
Span,” Journal of Personality and Social Psychology, 85 (2), 291-301.
Pennebaker, James W. and Anna Graybeal (2001), “Patterns of Natural Language Use: Disclosure,
Personality, and Social Integration,” Current Directions in Psychological Science, 10 (3), 90-3.
Pennebaker, James W. and Laura A. King (1999), “Linguistic Styles: Language Use as an
Individual Difference,” Journal of Personality and Social Psychology, 77 (6), 1296-1312.
Pennebaker, James W., Tracy J. Mayne, and Martha E. Francis (1997), “Linguistic Predictors of
45
Adaptive Bereavement,” Journal of Personality and Social Psychology, 72 (4), 863-871.
Pompian, Michael M. and John M. Longo (2004), “A New Paradigm for Practical Application of
Behavioral Finance: Creating Investment Programs Based on Personality Type and Gender to
Produce Better Investment Outcomes,” Journal of Wealth Management, 7, 9-15.
Pope, Devin G. and Justin R. Sydnor (2011), “What’s in a Picture? Evidence of Discrimination
from Prosper.com,” Journal of Human Resources, 46 (1), 53-92.
Preotiuc-Pietro, Daniel, et al. (2015), “The Role of Personality, Age and Gender in Tweeting about
Mental Illnesses,” NAACL HLT, 21-30.
Rudebusch, Glenn D. (2005), “Assessing the Lucas Critique in Monetary Policy Models." Journal
of Money, Credit, and Banking, 37(2), 245-272.
Rugh, Jacob S., and Douglas S. Massey (2010), “Racial Segregation and the American Foreclosure
Crisis,” American Sociological Review, 75 (5), 629-651.
Rustichini Aldo, Colin DeYoung, Jon E. Anderson, Stephen Burks (2016), “Toward the Integration
of Personality Theory and Decision Theory in Explaining Economic Behavior: An
Experimental Investigation,” Journal of Behavioral and Experimental Economics, 64, 122-137.
Schwartz, H. Andrew, et al. (2013), “Personality, Gender, and Age in the Language of Social
Media: The Open-Vocabulary Approach,” Plos One, 8 (9), e73791.
Sengupta, Rajdeep, and Geetesh Bhardwaj (2015), “Credit Scoring and Loan Default."
International Review of Finance, 15 (2), 139-167.
Shah, Anuj K., Sendhil Mullainathan, and Eldar Shafir (2012), “Some Consequences of Having
Too Little,” Science, 338 (6107), 682-685.
Sievert, Carson and Kenneth E. Shirley (2014), "LDAvis: A Method for Visualizing and
Interpreting Topics." Proceedings of the workshop on interactive language learning,
visualization, and interfaces, 63-70.
Slatcher, Richard B. and James W. Pennebaker (2006), “How Do I Love Thee? Let Me Count The
Words,” Psychological Science, 17 (8), 660-664.
Soman, Dilip and Amar Cheema (2011), “Earmarking and Partitioning: Increasing Saving by Low-
Income Households,” Journal of Marketing Research, 48(SPL), S14-S22.
Sonenshein, Scott, Michal Herzenstein, and Utpal M. Dholakia (2011), “How Accounts Shape
Lending Decisions Through Fostering Perceived Trustworthiness,” Organizational Behavior
and Human Decision Processes, 115 (1), 69-84.
46
Tausczik, Yla R. and James W. Pennebaker (2010), “The Psychological Meaning of Words: LIWC
and Computerized Text Analysis Methods,” Journal of Language and Social Psychology, 29
(1), 24-54.
Tibshirani, Robert (1997), “The Lasso Method for Variable Selection in the Cox Model,” Statistics
in Medicine, 16(4), 385–395.
Tirunillai, Seshadri and Gerard J. Tellis (2012), “Does Chatter Really Matter? Dynamics of User-
Generated Content and Stock Performance,” Marketing Science, 31 (2) 198-215.
Toma, Catalina L. and Jeffrey T. Hancock (2012), “What Lies Beneath: The Linguistic Traces of
Deception in Online Dating Profiles,” Journal of Communication, 62 (1), 78-97.
Tumasjan, Andranik, Timm Oliver Sprenger, Philipp G. Sandner, and Isabell M. Welpe (2010),
“Predicting Elections with Twitter: What 140 Characters Reveal About Political Sentiment,”
ICWSM, 178-185.
Van Heerde, Harald J., Marnik G. Dekimpe, and William P. Putsis Jr. (2005), “Marketing Models
and the Lucas Critique,” Journal of Marketing Research, 42 (1), 15-21.
Ventrella, Jeffrey J. (2011), Virtual Body Language, ETC Press.
Wei, Yanhao, Pinar Yildirim, Christophe Van den Bulte, and Chrysanthos Dellarocas (2015),
“Credit Scoring with Social Network Data,” Marketing Science, 35 (2), 234-258.
Weiss, Brent and Robert S. Feldman (2006), “Looking Good and Lying to Do It: Deception as an
Impression Management Strategy in Job Interviews,” Journal of Applied Social Psychology, 36
(4), 1070-1086.
Weinstein, Neil D. (1980) “Unrealistic Optimism about Future Life Events,” Journal of Personality
and Social Psychology, 39(5), 806-820.
Whalen, Sean and Gaurav Pandey (2013), “A Comparative Analysis of Ensemble Classifiers: Case
Studies in Genomics,” In Data Mining (ICDM), 2013 IEEE 13th International Conference, 807-
816.
Yarkoni, Tal (2010), “Personality in 100,000 Words: A Large-Scale Analysis of Personality and
Word Use among Bloggers,” Journal of Research in Personality, 44 (3), 363-373.
47
Table 1. Descriptive statistics for the Prosper data (n = 18,312)
Variables Min Max Mean SD Freq. Amount requested 1,000 25,000 6,507.3 5,732.9 Debt-to-income ratio 0 10.01 .33 .89 Lender interest rate 0 .350 .180 .077 Number of words in description 1 766 207.9 137.4 Number of words in title 0 13 4.593 2.015 % of long words (6+ letters) 0% 71.4% 29.8% 6.4% SMOG 3.129 12 11.347 1.045 Enchant spellchecker 0 56 2.986 3.074 # Prior listings 0 67 2.016 3.097 Credit grade: AA 0.086 A 0.082 B 0.186 C 0.219 D 0.170 E 0.128 HR 0.129 Loan repayment (1 = paid, 0 = defaulted) 0.669 Loan image dummy 0.670 Home owner dummy 0.470
Table 2. Area under the curve (AUC) for models with text only, financial and demographics information only, and a combination of both
(1) Text only
(2) Financial/
demog.
(3) Text & financial/
demog.
Improvement from (2) to (3)
Low credit grades: D, E, HR 61.33% 62.44% 64.96% 4.03%**
Medium credit grades: B, C 62.37% 65.72% 68.13% 3.67%**
High credit grades: AA, A 71.31% 76.05% 78.09% 2.68%**
Overall AUC 66.68% 70.52% 72.56% 2.89%**
Jaccard Index 35.85% 37.85% 38.85% 2.64%**
Notes: all AUCs are the averaged across 10 replications of 10-folds mean. See Figure 1 for a plot of the receiver operating characteristic (ROC) curve for the average across on randomly selected 10 fold. The Jaccard index is calculated as N00/(N01+N10+N00), where N00 is the number of correctly predicted defaults, N01 and N10 are the numbers of mispredicted repayments and defaults, respectively. ** represents significant improvements at the 0.05 level.
48
Table 3. The seven LDA topics and representative words with highest relevance
LDA topic Words with highest relevance (λ = 0.5)
Employment and School Work, Job, Full, School, Year, College, Income, Employ, Student
Interest Rate Reduction Debt, Interest, Rate, High, Consolidate, Score, Improve, Lower
Expenses Explanation Expense, Explain, Cloth, Entertainment, Cable, Why, Utility, Insurance, Monthly
Business and Real Estate Business, Purchase, Company, Invest, Fund, Addition, Property, Market, Build, Cost, Sell
Family Bill, Try, Family, Life, Husband, Medical, Reality, Care, Give, Children, Hard, Daughter, Chance, Son, Money, Divorce
Loan Details and Explanations Loan, Because, Candidate, Situation, Financial, Purpose, House, Expense, Monthly, Income
Monthly Payment Month, Payment, Paid, Total, Account, Rent, Mortgage, Save, List, Every, Payday, Budget Note: the sample words are chosen based on the relevance measure with λ = 0.5. See Table A8 in the Web Appendix for list of words with top relevance in each topic. Table 4. Binary regression with the thirteen LDA topics (repayment = 1)*
Financial and loan related variables Estimate (Std. E)
Textual variables Estimate (Std. E)
Amount Requested (in $105) -7.53 (0.36)
Number of words in Description (in 104) -4.02 (2.04)
Credit Grade HR -0.83 (0.08)
Number of spelling mistakes 0.00 (0.01)
Credit Grade E -0.47 (0.08)
SMOG (in 103) -1.8 20.9
Credit Grade D -0.34 (0.06)
Words with 6 letters or more -0.51 (0.40)
Credit Grade C -0.20 (0.06)
Number of words in the title (in 103) -7.70 (9.01)
Credit Grade A 0.81 (0.08)
Employment and school 1.95 (0.43)
Credit Grade AA 0.26 (0.07)
Interest rate reduction 2.68 (0.42)
Debt To Income -0.09 (0.02)
Expenses explanation -0.54 (0.61)
Images 0.06 (0.04)
Business and real estate loan 0.57 (0.38)
Home Owner Status -0.33 (0.04)
Family -1.28 (0.43)
Lender Interest Rate -5.20 (0.31)
Monthly payments 0.65 (0.40)
Bank Draft Fee Annual Rate -35.64 (19.52)
Prior Listings -0.03 (0.01)
Intercept 2.51 (0.40)
* Bold face for P-value ≤ 0.05. For brevity we do not report in this table the estimates of the demographics variables such as location, age, gender and race.
49
Table 5. Binary regression with LIWC (repayment = 1)*
Variable Beta (Std. E)
Variable Beta (Std. E)
Variable Beta (Std. E)
Variable Beta (Std. E)
Financial and basic text variables: LIWC dictionary: Amount Requested(x 105)
-7.163 (0.3668)
Swear words 35.5112 (35.275)
Past words -2.1032 (1.9895)
Person pronoun words
0.4119 (6.6093)
Credit Grade HR -0.8551 (0.0844)
Filler words 13.3939 (6.224)
Inhibition words
-2.3047 (3.4172)
Work words 0.5175 (0.9333)
Credit Grade E -0.4642 (0.0817)
Perception words 13.4328 (10.839)
Home words -2.3822 (1.7643)
Sexual words -10.5097 (10.828)
Credit Grade D -0.3383 (0.0623)
Relative words 9.1729 (2.3748)
Hear words -2.4191 (14.038)
They words -15.491 (9.3357)
Credit Grade C -0.1959 (0.0559)
Friend words 9.7894 (7.0217)
I words -2.7392 (8.1836)
Positive emotion words
0.2869 (2.0477)
Credit Grade A 0.7837 (0.0802)
Anxiety words 8.7494 (8.9305)
Tentative words -2.8712 (2.0522)
Money words 0.2085 (0.7944)
Credit Grade AA 0.2838 (0.0692)
Negate words 6.0709 (3.3228)
Non-fluency words
-3.2295 (9.518)
Ingest words -0.0434 (5.279)
Debt To Income -0.0906 (0.0186)
Insight words 5.0732 (2.8214)
Anger words -3.2911 (9.7405)
Verbs words -0.1936 (1.3174)
Images 0.0599 (0.0389)
We words 4.1277 (8.3628)
Achieve words -3.3204 (1.5601)
Adverbs words -0.3578 (1.8814)
Home Owner Status -0.3199 (0.0381)
Pronoun words 3.7935 (9.9981)
Incline words -3.5433 (2.3316)
Functional words -0.9427 (1.8725)
Lender Interest Rate -5.2556 (0.3148)
Exclusion words 3.1073 (2.7497)
She/he words -3.5689 (7.3598)
Bios words -1.3376 (2.6575)
Bank Draft Fee Annual Rate
-33.9126 (19.509)
Sad words 2.9955 (6.4981)
You words -3.714 (8.8219)
Assent words -1.3651 (14.463)
Prior Listings -0.0236 (0.0058)
Quantitative words
2.7495 (1.9363)
Cause words -3.7248 (2.407)
Family words -1.4804 (2.8298)
Number of words in Description(x 104)
-3.494 (1.96)
Articles 2.457 (2.0896)
Social words -4.2882 (1.5697)
I pronoun words -1.72 (10.026)
Number of spelling mistakes
-0.0124 (0.0068)
Numbers words 2.2907 (2.7328)
Health words -4.7602 (4.2679)
Death words -16.3445 (10.721)
SMOG -0.0252 (0.0209)
Preposition words 2.1719 (1.8415)
Certain words -5.2433 (2.7262)
Body words -19.1156 (5.8326)
Words with 6 letters or more
0.4455 (0.5716)
Conjoint words 1.8673 (1.8392)
Present words -6.223 (1.7067)
Religion words -20.2865 (6.7741)
Number of words in the title
-0.0062 (0.6035)
Auxiliary verbs words
1.7732 (2.3818)
Human words -7.5781 (3.5803)
Feel words -24.617 (11.734)
Affect words 1.2929 (1.5234)
Space words -8.2317 (2.5648)
See words -10.5021 (11.597)
Discrepancy words
1.2769 (2.686)
Future words -8.4576 (3.5391)
Leisure words 0.5548 (2.6577)
Cognitive mechanism words
0.7625 (1.8828)
Motion words -9.1849 (2.8071)
Intercept 3.6557 (0.6035)
Negative emotion words
0.7453 (4.7407)
Time words -9.4218 (2.3077)
* Bold face for P-value ≤ 0.05. For brevity we do not report in this table the estimates of the demographics variables such as location, age,
gender and race.
50
Figure 1. Receiver operating characteristics (ROC) curves for models with text only, financial and demographics information only, and a combination of both
Figure 2. Words indicative of loan repayment
Note: The most common words appear in the middle cloud (cutoff = 1:1.5) and then organized by themes. On the top-right, in green, and clockwise: relative words, time related words, “I words, words related to borrowing and debt, and words related to a brighter financial future.
51
Figure 3. Words indicative of loan default
Note: The most common words appear in the middle cloud (cutoff = 1:1.5) and then organized by themes. On the top, in black, and clockwise: words related to explanations, external influence and others, future tense, time, work, extremity, appealing to lenders, financial hardship, hardship, medical and family issues, and desperation and plea.
Figure 4: Naïve Bayes analysis for loan funding and loan default
52
WEB APPENDIX
When Words Sweat: Identifying Signals for Loan Default in the Text of Loan Applications
Table of Contents 1. Additional Information about our Analyses ................................................................... 53
A. Procedure for coding of the profile pictures for age, gender, and race ......................... 53
B. Random Forest and Extra Tress .................................................................................. 53
C. L1 regularization regression - predictive results .......................................................... 54
D. Latent Dirichlet allocation (LDA) - predictive results ................................................. 55
E. Linguistic Inquiry and Word Count (LIWC) - predictive results.................................. 55
2. Additional Tables and Figures ........................................................................................ 56
Table A1: Correspondence between Prosper’s credit grades and FICO scores ....................... 56
Table A2: Area under the curve (AUC) of the underlying models of the ensemble ................ 56
Table A3: L1 regularization binary logistic regression (1 = repayment). ................................ 57
Table A4a: Bi-grams that appeared frequently in repaid loans ............................................... 64
Table A4b: Bi-grams that appeared frequently in defaulted loans .......................................... 67
Table A5: Top 120 variables with the highest importance in the Random Forest analysis ...... 70
Table A6: Summary statistics of dataset of all loan requests (n = 122,479) ............................ 71
Table A7a: Bi-grams that appeared frequently in funded loans .............................................. 72
Table A7b: Bi-grams that appeared frequently in unfunded loans .......................................... 76
Table A8: Lists of the top 30 words with the highest relevance measure for each LDA topic 79
Figure A1: LDA analysis – selecting the number of topics based on perplexity ..................... 81
53
1. Additional Information about our Analyses
A. Procedure for coding of the profile pictures for age, gender, and race
About a third of the borrowers’ profiles in our data (6,078 profiles) included at least one picture that is not a stock photo, however many pictures were not of the borrower, or included more than one person. To identify the borrower in the picture we manually coded the borrower’s profile pictures, using the following process. If the picture included captions, we relied on it to identify the borrower (for example, “My lovely wife and I”). If the picture did not include captions and there was one adult in the picture, we assumed the adult in the picture was the borrower (following the procedure in Pope and Sydnor 2011). Once borrowers were identified, we recorded their gender (Female, Male, “Cannot Tell”), age (in three brackets: Young, Middle-aged, Old), and race (Caucasian, African American, Asian, Hispanic, or “Cannot Tell”). If the picture included more than one adult and there were no captions or if the picture did not include any adult (e.g., the picture included kids, pets, or a kitchen project) we could not identify the borrower and therefore defined the gender and race of that picture as “cannot tell”. We augmented the age in unidentified pictures with the average age of the identified pictures with the three ages categories coded as 1, 2 and 3, respectively. Each picture was evaluated by at least two different undergraduate student coders, who were unaware of the research objective. Cohen Kappas suggest fairly high levels of agreement across coders, gender = 0.89, race = 0.67, and age = 0.44.10 Disagreements were resolved by an additional coder who served as the final judge, observing the rating of the previous coders. B. Random Forest and Extra Tress
Random Forest and Extremely Randomized Trees (Extra Trees) are ensemble of trees. The idea behind both models is to combine a large number of decision trees. In these models, trees are chosen to resolve misclassification of previously included trees. The Random Forest randomly draws with replacements subsets of the calibration data to fit each tree, and a random subset of features (variables) is used in each tree. In the Variance Selection Random Forest features are chosen based on a variance threshold determined by cross validation. The idea behind variance selection threshold is to remove features that don’t meet certain threshold. By definition, features that have zero variance (same value in all samples) are removed (for further details see http://scikit-learn.org/stable/modules/feature_selection.html#variance-threshold). We tested the variance in the range of 0.001-0.0650 by increments of 0.00025. We find the variance to be in the range of 0.00175-0.00375 across folds. In the Best Feature Selection Random Forest features are selected based on a 𝜒N test. That is, we select the K-features with the highest 𝜒N score (other approaches include F-values or mutual information criteria. For further details see http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_s
10 Because agreement across coders for age was lower, we also tested a model without this variable. Excluding the age variable did not qualitatively affect our results.
54
election.SelectKBest). We use cross-validation to determine the “optimal” value of K. We allowed K to vary between a minimum of 10 features and a maximum of half of the training features (over 500 features in our case), by increment of 50 features. We find the number of feature to be in the range of 60-260 across folds. The Extra Trees is an extension of the Random Forest in which the cut-off point (the split) for each feature in the tree are also chosen at random (from a uniform distribution) and the best split among them is chosen (for further details see http://scikit-learn.org/stable/modules/generated/sklearn.tree.ExtraTreeClassifier.html). We use the maximal and minimal value of each feature observed in the data to select the boundaries of the uninform distribution for each feature. See Due to the size of the feature space, we first apply a K-Best Feature Selection, as described above, to select the features to be included in the Extra Trees. We find the number of feature to be in the range of 60-460 across folds. For all tree-based methods, to limit over-fitting of the trees, we randomized the parameter optimization (Bergstra and Bengio 2012) using a 3-fold cross validation on the calibration data to determine the structure of the tree (e.g., number leaves, number of splits, depth of the tree, and criteria). We use a randomized parameter optimization rather an exhaustive search (or a grid search) due to the large number of variables in our model. The parameters are sampled from a distribution (uniform) over all possible parameter values. We set the ranges for the parameters that dictate the structure of the trees as follows:
• Number of leaves [1-11] • Depth of the tree [3 - max number of features] • Minimum sample split [1-11] • Min sample leaf [1-11] • Criteria for splits [Gini or Entropy]
Reference Bergstra, James, and Yoshua Bengio (2012), "Random Search for Hyper-Parameter Optimization." Journal of Machine Learning Research, 13 (Feb), 281-305. C. L1 regularization regression - predictive results
We tested whether the naïve Bayes findings are sensitive to the inclusion of demographics and financial information and the interdependence among words we employ a logistic regression with an L1 penalization with same 1,032 bi-grams used in the ensemble learning and naïve Bayes analysis as well as the demographic and financial information. This analysis, while less easily interpretable than the naïve Bayes, provided very similar qualitative results (see Tables A3 and A4). The correlation between the results of the naïve Bayes and the L1 regression is 0.582 (P < 0.01). The L1 regression results confirm that the writing styles and intentions we identified through the naïve Bayes analysis are not merely a proxy of the demographic and financial information.
55
D. Latent Dirichlet allocation (LDA) - predictive results
Although the purpose of the LDA analysis was to learn about the topics discussed in loan requests rather than to predict default, we nevertheless tested the predictive ability of the uncovered topics. We find that the model that includes the LDA topics fits the data better than a model that does not include the textual information in terms of the Akaike information criterion (AICLDA = 20,872 and AICnotext = 21,078). Furthermore, the likelihood ratio test significantly supports the model with the textual information relative to the model without the textual information (LRDF=17 = 227.34, p < 0.001). We ran a 10-fold cross validation similar to the one conducted for the ensemble learning model. We find that the model with the LDA topics and the other textual variables (e.g., number of characters in the loan request) predicts defaults better than a baseline model that includes all the financial and demographic information but no textual information (AUCLDA = 71.4% vs. AUCnoLDA = 70.1%). The model with the LDA variables provided higher AUC relative to the model without the textual information in all 10 folds.11
E. Linguistic Inquiry and Word Count (LIWC) - predictive results
We find that the model that includes the LIWC dictionaries fits the data better than a model that does not include the textual information in terms of the Akaike information criterion (AICtext = 20,900 and AICnotext = 21,078), and the likelihood ratio test (LRDF=69 = 319.94, p < 0.001). To test for the predictive ability of this model we ran a 10-fold cross validation similar to the one conducted for the ensemble learning model. We find that the model with LIWC predicts defaults better than a baseline model that includes all the financial and demographic information but no textual information in all 10-folds (average AUCLIWC = 70.9% vs. AUCnoLIWC = 70.1%).
11 We also estimate an ensemble predictive model with the LDA topics (in addition to the 1,032 bi-grams). The AUC of the ensemble with the LDA (AUC = 72.54%), is very similar to the ensemble without the LDA (AUC=72.56%).
56
2. Additional Tables and Figures
Table A1: Correspondence between Prosper’s credit grades and FICO scores
Grade AA A B C D E HR Score 760+ 720-759 680-719 640-679 600-639 560-599 520-559
Table A2: Area under the curve (AUC) of the underlying models of the ensemble
Text Only (Model 1)
Financial/demog.
(Model 2)
Text & Financial/demog
(Model 3) Logistic L1 65.41% 70.55% 71.86%*** Logistic L2 66.82% 68.98% 71.62%*** Random Forrest (Variance Selection) 64.56% 69.68% 71.67%*** Random Forrest (Best Features Selection) 62.60% 70.48% 71.33%*** Extremely Randomized Trees (Extra Trees) 65.01% 70.35% 71.23%*** The Stacking Ensemble Model 66.68% 70.52% 72.56%***
*** Difference between the Model 2 and Model 3 is statistically significant at the 0.01 level based on the binomial test.
57
Table A3: L1 regularization binary logistic regression (1 = repayment).
Results for variables with β ≠ 0 Variable Beta Variable Beta Variable Beta Amount Requested (x 1000) -0.0684 health 2.0569 step 0.9874 Credit Grade HR -0.7297 year ago 1.8547 ani question 0.9852 Credit Grade E -0.3821 side 1.7264 payment and 0.9688 Credit Grade D -0.2980 lower interest 1.6991 minimum 0.9558 Credit Grade C -0.1592 august 1.6706 earli 0.9527 Credit Grade A 0.8027 borrow 1.6181 unfortun 0.9439
Credit Grade AA 0.3022 prosper lender 1.5487 last year 0.9366 Debt To Income Ratio -0.0840 com 1.4588 your 0.9358 Images 0.0155 than 1.4532 your consider 0.9038 Is Borrower Homeowner -0.2856 averag 1.4005 while 0.9003 Lender Rate -5.1851 lend 1.3887 fall 0.8904 Bank Draft Fee Annual Rate 0.0000 card debt 1.3826 loan payment 0.8880 New England 0.0754 bonu 1.3187 student 0.8749 Middle East 0.2965 car payment 1.3083 the cost 0.8748 Great Lakes 0.0708 card with 1.3026 run 0.8726 Plains Regions 0.0443 and current 1.2919 larg 0.8598 South West 0.0179 consult 1.2630 tax 0.8495 Rocky Mountain 0.2650 dure 1.2123 anoth 0.8416 Far West 0.0726 few month 1.2043 low 0.8403 Military 1.3085 pay for 1.1987 contribut 0.8180 # number of words in description -0.0006 graduat 1.1670 and had 0.8006 Spelling Mistakes -0.0033 and plan 1.1565 and our 0.7959 SMOG -0.0237 wed 1.1529 thi debt 0.7940 % of words greater 6 -0.1602 but have 1.1525 cover the 0.7819 Gender Male 0.1377 goe 1.1193 colleg 0.7779 Gender Female 0.0000 pay thi 1.1160 electr 0.7696 Age -0.1769 car insur 1.1069 coupl 0.7643 Race White 0.1454 the credit 1.1045 the first 0.7629 Race African American -0.2214 purchas 1.0846 inform 0.7600 Race Asian 0.2814 payment for 1.0775 order 0.7490 Race Hispanics 0.0000 the debt 1.0718 the other 0.7487 # Prior Loan Listings -0.0209 student loan 1.0715 appli 0.7459 Race Unknown 0.0000 reflect 1.0680 off the 0.7414 Gender Unknown 0.0565 last 1.0514 the minimum 0.7373 Attractive -0.0016 and get 1.0505 sinc 0.7351 Trustworthy 0.0729 off thi 1.0500 bank 0.7313 reinvest 2.8241 save 1.0237 longer 0.7272 the balanc 2.1761 incom ratio 1.0193 activ 0.7182 invest 2.0809 even 1.0133 rental 0.7074
58
Variable Beta Variable Beta Variable Beta
almost 0.7036 past year 0.5703 both 0.3572 job with 0.6949 entir 0.5637 off with 0.3567 the high 0.6933 manag 0.5606 balanc 0.3567 the process 0.6901 file 0.5563 along 0.3470 ever 0.6884 futur 0.5561 water 0.3456 never 0.6850 into one 0.5422 comput 0.3425 earn 0.6843 posit 0.5286 year have 0.3397 good job 0.6731 quickli 0.5212 travel 0.3376 you will 0.6654 use credit 0.5162 been pay 0.3375 those 0.6636 turn 0.5133 have steadi 0.3373 june 0.6634 learn 0.5127 chang 0.3373 car loan 0.6623 and are 0.5058 paid off 0.3364 free 0.6619 understand 0.5048 payment thi 0.3298 get out 0.6595 each 0.4864 ga 0.3286 over the 0.6592 big 0.4853 singl 0.3233 see 0.6582 togeth 0.4812 five 0.3202 decid 0.6573 rebuild 0.4773 toward 0.3172 the payment 0.6539 too 0.4693 clear 0.3144 that can 0.6518 detail 0.4652 happen 0.3057 but the 0.6460 share 0.4626 rather 0.3040 although 0.6349 salari 0.4526 becaus have 0.2977 prosper and 0.6347 instead 0.4518 interest credit 0.2947 creat 0.6346 everi month 0.4427 payment the 0.2932 teacher 0.6333 though 0.4399 been with 0.2887 could 0.6283 improv credit 0.4349 stabl 0.2859 less 0.6274 way 0.4283 addit 0.2842 least 0.6246 life 0.4254 for your 0.2774 they are 0.6234 avail 0.4249 small 0.2750 return 0.6229 year monthli 0.4133 engin 0.2715 have two 0.6161 have not 0.4033 interest rate 0.2639 with prosper 0.6154 reliabl 0.4013 the past 0.2581 debt free 0.6124 under 0.3927 part 0.2568 mistak 0.6069 max 0.3927 fee 0.2567 realiz 0.5979 for over 0.3883 mean 0.2546 grow 0.5967 number 0.3883 owner 0.2525 problem 0.5948 should 0.3846 thi will 0.2508 account 0.5821 close 0.3829 through 0.2472 have great 0.5812 again for 0.3711 myself 0.2466 abov 0.5743 risk 0.3683 there 0.2465 elimin 0.5716 experi 0.3679 half 0.2408 everi 0.5711 it 0.3669 go 0.2362
59
Variable Beta Variable Beta Variable Beta
financ 0.5705 loan from 0.3623 thank for 0.2361 teach 0.2320 truck 0.1161 never miss 0.0108 build 0.2260 recent 0.1140 ask 0.0083 prior 0.2205 offer 0.1118 hous and 0.0082 sure 0.2114 car and 0.1099 live 0.0007 our 0.2083 degre 0.1077 and want 0.0005 extra 0.2052 expect 0.1050 can pay -0.0015 appreci 0.2037 remain 0.1020 respons -0.0019 employe 0.2019 into 0.1006 title -0.0020 our credit 0.1998 have veri 0.1000 oblig -0.0043 promot 0.1981 veri good 0.0940 rent -0.0074 well 0.1887 right now 0.0939 loan need -0.0109 improv 0.1859 system 0.0917 equiti -0.0118 husband and 0.1858 than the 0.0873 room -0.0123 point 0.1830 retir 0.0871 finish -0.0134 the bank 0.1789 loan thank 0.0827 book -0.0134 continu 0.1787 excel credit 0.0787 public -0.0207 made 0.1786 provid 0.0765 profit -0.0209 make payment 0.1780 apart 0.0736 look for -0.0216 extrem 0.1686 solid 0.0715 high interest -0.0217 more than 0.1682 until 0.0706 prioriti -0.0280 own home 0.1663 after 0.0692 the interest -0.0394 major 0.1657 soon 0.0654 includ -0.0449 budget 0.1645 down 0.0650 perfect -0.0501 miss payment 0.1592 show 0.0641 month for -0.0505 marri 0.1557 incur 0.0633 what -0.0510 debt that 0.1554 fix 0.0622 compani -0.0512 cover 0.1520 final 0.0581 and that -0.0517 bankruptci 0.1488 off credit 0.0566 stay -0.0530 help with 0.1478 onli 0.0564 school -0.0549 default 0.1474 expens are 0.0552 and can -0.0568 payoff 0.1466 possibl 0.0500 plan -0.0576 time have 0.1441 anyth 0.0467 state -0.0590 loan that 0.1437 collect 0.0442 and make -0.0601 schedul 0.1414 still 0.0429 home and -0.0603 credit score 0.1386 guarante 0.0322 veri respons -0.0668 cost 0.1365 current have 0.0290 own -0.0674 client 0.1287 here 0.0289 score and -0.0678 except 0.1283 next year 0.0262 have good -0.0686 given 0.1250 littl 0.0228 will allow -0.0725
60
Variable Beta Variable Beta Variable Beta
oper 0.1200 and not 0.0197 are not -0.0747 establish 0.1168 month that 0.0167 histori -0.0777 loan which -0.0815 will also -0.2218 deduct -0.3361 strong -0.0848 stabl job -0.2237 loan have -0.3397 mother -0.0890 dream -0.2249 becaus the -0.3398 and would -0.0926 onc -0.2263 support -0.3432 year with -0.0978 father -0.2268 thi prosper -0.3438 save for -0.0981 between -0.2290 time job -0.3450 thi year -0.0990 the end -0.2320 doe -0.3550 veri hard -0.1002 expand -0.2336 and will -0.3597 may -0.1049 payment other -0.2350 husband -0.3667 bought -0.1081 mani -0.2414 age -0.3679 and credit -0.1087 the same -0.2479 seem -0.3682 leav -0.1095 equip -0.2535 three -0.3761 pleas help -0.1106 care -0.2591 rate and -0.3794 clean -0.1137 new -0.2669 develop -0.3945 relist -0.1153 day -0.2692 give -0.3978 credit rate -0.1189 year old -0.2722 again -0.4005 hope -0.1201 the loan -0.2723 tri -0.4014 repair -0.1223 find -0.2733 be -0.4063 work and -0.1231 do -0.2833 assist -0.4070 also have -0.1257 univers -0.2856 loan off -0.4094 top -0.1259 credit and -0.2858 surgeri -0.4115 for loan -0.1266 higher -0.2860 attend -0.4144 budget mortgag -0.1372 debt and -0.2875 difficult -0.4155 yr -0.1407 will make -0.2890 mine -0.4170 know -0.1408 list and -0.2894 charg -0.4228 dont -0.1483 score -0.3015 store -0.4231 plu -0.1489 and the -0.3016 expens car -0.4311 much -0.1546 wife -0.3033 educ -0.4327 need thi -0.1562 not have -0.3054 from the -0.4361 use thi -0.1639 abl -0.3069 mortgag -0.4368 what you -0.1782 love -0.3119 child -0.4401 help pay -0.1890 receiv -0.3119 incom and -0.4422 off all -0.1944 answer -0.3181 citi -0.4423 forward -0.2088 record -0.3208 repay thi -0.4484 thing -0.2097 time monthli -0.3215 locat -0.4554 juli -0.2101 call -0.3221 capit -0.4591 and just -0.2101 will pay -0.3234 afford -0.4605 product -0.2134 for our -0.3252 licens -0.4606
61
Variable Beta Variable Beta Variable Beta
consolid -0.2138 item -0.3268 their -0.4619 request -0.2193 obtain -0.3278 prosper loan -0.4637 the money -0.2194 alway -0.3298 medic -0.4727 loan becausei -0.2216 card that -0.3342 properti -0.4818 come -0.4900 kid -0.6742 will have -1.1321 for and -0.4994 advanc -0.6784 person -1.1359 where -0.4999 ani -0.6847 took -1.1447 back thi -0.5013 just need -0.6878 sale -1.1680 around -0.5046 have alway -0.6892 industri -1.1713 loan pay -0.5072 dti -0.7008 maintain -1.1923 ago and -0.5108 with the -0.7013 gener -1.1978 pay back -0.5298 stress -0.7231 daughter -1.2414 and need -0.5309 per -0.7372 local -1.2763 within the -0.5328 divorc -0.7406 promis -1.2867 these -0.5471 monthli payment -0.7425 son -1.3027 abil -0.5481 the opportun -0.7440 get back -1.4633 and help -0.5500 off some -0.7455 god -1.5715 date -0.5503 refin -0.7578 estat -1.9481 field -0.5609 they -0.7642 lost -1.9870 verifi -0.5642 late -0.7703 thank you -2.6051 she -0.5651 need help -0.7771 report -0.5670 long -0.7893 Intercept 1.9431 the time -0.5673 dollar -0.7904 explain -0.5723 behind -0.7926 famili -0.5743 becausei -0.7933 valu -0.5941 bid -0.7949 have work -0.5974 been the -0.7964 everyth -0.6013 need the -0.8029 left over -0.6025 project -0.8136 loan with -0.6121 loan and -0.8236 the year -0.6137 monthli incom -0.8270 someon -0.6157 bit -0.8356 interest loan -0.6170 taken -0.8561 price -0.6324 and veri -0.8645 sever -0.6410 sourc -0.8747 then -0.6437 websit -0.9171 off high -0.6450 busi -0.9218 the compani -0.6452 them -0.9231 were -0.6502 total monthli -0.9243
62
The table above reports the variables in the regression that were not set to zero. Below we list the variables that were set to zero. Note, that while one can use bootstrap approach to obtain standard errors for the L1 regularization binary logistic regression parameter estimates, because the parameters of the L1 regularization model are biased, standard errors in a regularized regression are not meaningful (Park and Casella 2008). Accordingly, we do not report standard errors in the Table A3. Variables with β = 0:
Financial and demographic variables:
Number of words in the title, Credit Grade = C, Images, Lender Rate, Bank Draft Fee Annual Rate, New England, Mid East, Great Lakes, Plains Regions, South West, Rocky Mountain, Far West, Military, Gender = Male, Gender = Female, Gender Unknown, Age, Race Unknown, Race = White, Race = African American, Race = Asian, Race = Hispanics, Words with 6 or more letters. Bi-grams (listed here alphabetically):
abl pay, about month, about year, account and, actual, ad, add, after tax, ago, ahead ,all bill, all credit, all debt, all our, all the, allow, almost year, alreadi, alway paid, alway pay, america, amount, and also, and for, and ha, and hope, and i'm, and now, and pay, and start, and take, and thank, and then ,and they, and thi, and wa, and worl, annual, approx, approxim, are good, are paid, area, ask for ,auto, automat, away, back the, back track, bad, base, becom, been employ, been late, been work, befor, begin, believ, below, benefit, best, better, bill time, bless, bring, busi and, buy, came, can get, can see, can't, card balanc, card financi, card have, career, case, cash, cash flow, catch, caus, cell, cell phone, chanc, check, child support, children, class, combin, commit, commun, compani and, compani for, complet, consid, consider, consolid credit, contact, contract, cours, credit histori, credit report, current employ, current work, custom, cut, deal, debt financi, debt have, debt incom, decis, delinqu, depend, deposit, did, did not, didn't, differ, direct, doe not, done, don't, don't have, down the, due, due the, dure the, each month, easili, emerg, employ,employ for, end, enjoy, enough, etc, everyon, excel, exist, expens and, expens for, expens ga, expens total, explain what, explain whi, far, feel, feel free, few, few year, figur, firm, first, flow, for almost, for consid, for financi, for month, for pay, for prosper, for take, for view, for year, found, four, friend, from prosper, full, full time, fulli, fund, further, ga util, get rid, get the, get thi, goal, god bless, gone, good credit, got, grade, great, greatli, groceri, gross, group, ha been, happi, hard work, have alreadi, have ani, have credit, have excel, have had, have learn, have made, have never, have one, have over, have paid, have problem, have some, have stabl, have the, hello, help get, help out, her, hi, higher interest, him, hold, honest, hospit,hour, how, howev, i'd, i'll, i'm, i'm not ,immedi, import, incom after, incom from, increas, individu, intend, into the, issu, it', i'v, i'v been, job and, job for, keep, know that, late payment, law, leas, left, lender, less than, lesson,let, level, like pay, limit, list, live with, loan back, loan consolid, loan credit, loan explain, loan financi, loan for, loan help, loan monthli, loan request, loan the, loan would, look,lot, lower, make the, market, medic bill, meet, member, minimum payment, misc, miss, mom, money and, money for, money pay, month ago, month and, month have, month monthli, monthli budget, mortgag rent, most,move, name, need pay ,never been, next,normal, not includ, not onli, note,
63
now and, now have, off and, off debt, offic, old, one payment, one the, open, opportun, origin, our home, out the, outstand, over year, overtim, owe, paid for, paid full, parent, part time, pass, past, pay all, pay bill, pay down, pay the, pay them, paycheck, payday, payment have, payment prosper, payment time, payment will, peopl, per month, period, person loan, pictur, place, plan pay, pleas, post, present, pretti, previou, process, profession, profil, program, prosper payment, prosper will, prove, purpos thi, put, question, quit, rais, ratio, read, readi, real, real estat, realli, reason, rebuild credit, reduc, remov, rent, insur, repay, replac, requir, rest, result, review, revolv, rid, right, same, save and, say, school and, second, secur, see have, seek, self, sell, servic, set, short, site, situat explain, situat have, six, sold, some credit, someth, spend, stand, start, steadi, success, such, summer, take care, take the, term, that are, that ha, that have, that the, that thi, that time, that wa, that will, that would, that you, the amount, the best, the bill, the busi, the follow, the futur, the hous, the last, the monthli, the mortgag, the new, the next, the onli, the prosper, the purpos, the reason, the remain, the rest, there are, thi money, thi time, think, thought, three year, time and, time everi, time for, time the, top prioriti, total, total expens, track, tri get, tuition, two, two year, unexpect, use consolid, use for, use help, use pay, use the, usual, vehicl, view, view list, wa not, wait, want, want pay, week, went, when, when wa, whi, whi you, who, wife', wife and, will abl, will help, will not, will paid, with credit, with thi, within, without, wonder, won't, work for, work full, work hard, work the, work with, worth, would have, would like, would use, www, year and, year now, year the, yet, you can, you for, you have, young, your help, your time Reference Park, Trevor and George Casella (2008), “The Bayesian Lasso,” Journal of the American Statistical Association, 103 (482), 681-686.
64
Table A4a: Bi-grams that appeared frequently in repaid loans
p(word|repaid)/p(word|defaulted) ≥ 1.1
Bi-gram (repaid) Ratio Bi-gram (repaid) Ratio Bi-gram (repaid) Ratio reinvest 4.78 low 1.44 the interest 1.30 lend 2.25 com 1.43 guarantee 1.30 i'd 2.02 cover the 1.43 usual 1.29 lower interest 2.01 miss payment 1.41 pay for 1.29 side 1.97 share 1.41 like pay 1.29 prosper lender 1.87 earli 1.41 good credit 1.29 borrow 1.80 than 1.40 miss 1.28 invest 1.80 the cost 1.40 incur 1.28 wedding 1.79 own home 1.40 current have 1.28 student loan 1.75 car and 1.39 your consider 1.28 excel credit 1.74 off this 1.39 and current 1.28 than the 1.70 income ratio 1.38 earn 1.28 card with 1.67 use credit 1.38 schedule 1.28 graduate 1.66 consult 1.38 less 1.28 rather 1.66 apart 1.36 payment for 1.28 the balance 1.66 worth 1.36 ever 1.28 student 1.65 figure 1.36 didn't 1.28 the minimum 1.65 don't 1.36 car insurance 1.27 contribute 1.62 interest rate 1.36 mean 1.27 it' 1.57 job with 1.35 health 1.27 entire 1.56 activ 1.35 debt that 1.27 never miss 1.54 return 1.35 save 1.27 thi debt 1.54 i'm 1.35 paid for 1.27 the bank 1.54 expect 1.35 all bill 1.27 i'v been 1.53 easily 1.35 down the 1.27 card debt 1.52 quickly 1.33 debt free 1.27 risk 1.52 travel 1.33 every month 1.27 engin 1.51 more than 1.33 retire 1.27 i'v 1.49 expens are 1.32 have already 1.26 i'll 1.49 balance 1.32 excel 1.26 prosper and 1.49 pretty 1.32 have great 1.26 minimum 1.49 college 1.32 default 1.26 and plan 1.49 firm 1.32 promote 1.26 goe 1.48 big 1.32 higher 1.26 after tax 1.48 rate and 1.31 averag 1.26 thank for 1.47 august 1.31 possible 1.25 and i'm 1.47 spend 1.31 have very 1.25 minimum payment 1.47 less than 1.30 good job 1.25 have excel 1.47 instead 1.30 one the 1.25 summer 1.46 debt income 1.30 have never 1.25 bonus 1.44 the debt 1.30 off credit 1.25
65
Bi-gram (repaid) Ratio Bi-gram (repaid) Ratio Bi-gram (repaid) Ratio
debt have 1.25 year monthly 1.20 term 1.16 and want 1.25 toward 1.20 past year 1.16 because have 1.25 card financi 1.20 fully 1.16 above 1.25 the remain 1.19 the other 1.16 except 1.25 time for 1.19 off debt 1.16 lower 1.25 way 1.19 credit history 1.16 lender 1.24 free 1.19 profile 1.16 about month 1.24 pay this 1.19 use pay 1.15 solid 1.24 very good 1.19 addit 1.15 been pay 1.24 payment this 1.19 and would 1.15 even 1.24 revolve 1.19 gas 1.15 have good 1.24 tax 1.19 into one 1.15 debt financi 1.24 cost 1.19 picture 1.15 while 1.23 with credit 1.19 degree 1.15 the credit 1.23 under 1.18 law 1.15 higher interest 1.23 think 1.18 comput 1.15 situation have 1.23 major 1.18 last year 1.15 stable job 1.23 book 1.18 extra 1.15 too 1.23 order 1.18 credit rate 1.15 salary 1.23 plan 1.18 out the 1.15 you have 1.23 payment the 1.18 car loan 1.15 though 1.23 feel free 1.17 replace 1.15 interest credit 1.23 case 1.17 how 1.15 cover 1.23 max 1.17 look 1.14 dure the 1.23 card have 1.17 outstand 1.14 save for 1.23 ratio 1.17 through 1.14 the high 1.23 five 1.17 profession 1.14 card balance 1.23 thought 1.17 offer 1.14 step 1.22 plan pay 1.17 and are 1.14 understand 1.22 have two 1.17 although 1.14 payment and 1.22 don't have 1.17 paid full 1.14 least 1.22 half 1.17 large 1.14 always pay 1.22 year now 1.17 consolid credit 1.14 have steady 1.22 teach 1.17 experience 1.14 expense gas 1.22 dure 1.16 financ 1.14 teacher 1.21 payment have 1.16 annual 1.14 but the 1.21 have credit 1.16 require 1.14 stable 1.21 future 1.16 loan the 1.14 any question 1.21 ad 1.16 cash flow 1.14 pay down 1.21 question 1.16 paid off 1.14 the first 1.21 reflect 1.16 reason 1.13 gas util 1.21 decide 1.16 current employ 1.13 late payment 1.21 note 1.16 young 1.13 next year 1.21 fix 1.16 univers 1.13 purchase 1.20 use consolid 1.16 bit 1.13 three year 1.20 apply 1.16 add 1.13 below 1.20 buy 1.16 year ago 1.13 site 1.20 couple 1.16 loan from 1.13 happy 1.20 far 1.16 fall 1.13 income after 1.20 anything 1.16 both 1.13 already 1.20 longer 1.16 thi money 1.13 few month 1.20 together 1.16 would like 1.13
66
Bi-gram (repaid) Ratio Bi-gram (repaid) Ratio Bi-gram (repaid) Ratio have stable 1.13 few year 1.11 never 1.10 best 1.13 group 1.11 extreme 1.10 should 1.13 the last 1.11 approximat 1.10 card that 1.13 next 1.11 strong 1.10 are paid 1.12 course 1.11 post 1.10 reduce 1.12 over the 1.11 rental 1.10 limit 1.12 reliable 1.11 point 1.10 flow 1.12 consider 1.11 intend 1.10 live 1.12 clear 1.11 year have 1.10 off high 1.12 feel 1.11 when was 1.10 number 1.12 remain 1.11 and pay 1.10 house and 1.12 have any 1.11 avail 1.10 june 1.12 www 1.11 never been 1.10 system 1.12 stand 1.10 small 1.10 base 1.12 something 1.10 full time 1.10 history 1.12 create 1.10 for your 1.10 have not 1.12 charge 1.10 actual 1.10 month have 1.11
67
Table A4b: Bi-grams that appeared more frequently in defaulted loans
p(word|defaulted)/p(word|repaid) ≥ 1.1
Bi-gram (defaulted) Ratio Bi-gram (defaulted) Ratio Bi-gram (defaulted) Ratio payday loan 2.13 are good 1.46 father 1.31 payday 2.06 daughter 1.45 real 1.31
view list 2.03 see have 1.44 fact 1.31
god 2.02 cause 1.43 child 1.31
god bless 1.99 time every 1.43 the business 1.31
need help 1.88 worker 1.43 mother 1.31
for view 1.87 you are 1.42 surgery 1.31
top priority 1.86 and credit 1.42 hard work 1.30
bless 1.75 real estate 1.42 source 1.30
lost 1.74 follow 1.42 rebuild 1.30
the follow 1.74 estate 1.41 normal 1.30
view 1.72 left over 1.41 hard 1.30
get back 1.69 back this 1.41 expand 1.29
for prosper 1.69 location 1.40 automat 1.29
priority 1.68 pleas help 1.40 mom 1.29
promise 1.68 not online 1.40 month that 1.29
list and 1.67 prove 1.40 lesson 1.29
payment prosper 1.66 just need 1.40 tri get 1.28
prosper will 1.65 divorce 1.39 local 1.28
back track 1.64 medic bill 1.38 she 1.28
stress 1.64 for pay 1.37 know that 1.28
behind 1.64 again 1.36 this prosper 1.28
would use 1.60 very hard 1.35 bill and 1.28
loan explain 1.58 call 1.35 pay back 1.28
year 1.57 top 1.35 track 1.28
situation explain 1.56 been the 1.35 payment will 1.27
help get 1.55 hello 1.34 day 1.27 explain what 1.54 the opportunity 1.34 off some 1.27 catch 1.54 total monthly 1.34 sale 1.27 prosper payment 1.54 refin 1.34 capital 1.27 again for 1.53 have learn 1.33 advance 1.27 son 1.53 everyone 1.33 industry 1.26 what you 1.51 have over 1.32 support 1.26 explain why 1.51 rebuild credit 1.32 business and 1.26 child support 1.51 thank you 1.32 direct 1.26 someone 1.51 store 1.32 city 1.26 explain 1.50 project 1.32 husband 1.26 chance 1.50 will able 1.32 attend 1.25 and thank 1.50 equipment 1.32 for take 1.25 why you 1.49 children 1.31 relist 1.25
68
Bi-gram (defaulted) Ratio Bi-gram (defaulted) Ratio Bi-gram (defaulted) Ratio assist 1.25 pass 1.21 maintain 1.17 bad 1.25 monthly budget 1.21 budget mortgage 1.17
year old 1.25 product 1.20 loan request 1.16
business 1.25 you will 1.20 place 1.16
because the 1.24 verify 1.20 person 1.16 difficult 1.24 will have 1.20 loan and 1.16 left 1.24 time monthly 1.20 the company 1.16 name 1.24 the fund 1.20 got 1.16 take care 1.24 all our 1.20 repair 1.16 took 1.24 the prosper 1.19 deduct 1.16 medic 1.24 and help 1.19 save and 1.16 were 1.24 came 1.19 improve credit 1.16 family 1.24 mistake 1.19 with the 1.16 work hard 1.24 opportunity 1.19 clean 1.15 have always 1.24 general 1.19 need pay 1.15 develop 1.23 need the 1.19 them 1.15 that you 1.23 the bill 1.19 loan for 1.15 license 1.23 taken 1.19 help pay 1.15
credit report 1.23 person loan 1.19 severe 1.15 hospital 1.23 and just 1.18 age 1.15 work with 1.23 old 1.18 and start 1.15 honest 1.23 care 1.18 list 1.15 ask for 1.23 oper 1.18 file 1.15 our home 1.23 present 1.18 the mortgage 1.15 item 1.23 and can 1.18 issue 1.15 let 1.22 pleas 1.18 america 1.15 which will 1.22 the new 1.18 loan monthly 1.15 you for 1.22 payment other 1.18 the time 1.15 custom 1.22 overtime 1.17 begin 1.15 loan pay 1.22 prosper loan 1.17 can't 1.15 that need 1.22 interest loan 1.17 can get 1.14 everything 1.22 i'm not 1.17 check 1.14 kid 1.22 off all 1.17 obtain 1.14 and was 1.21 ha been 1.17 july 1.14 report 1.21 her 1.17 review 1.14 hi 1.21 why 1.17 dollar 1.14 and need 1.21 this time 1.17 total expense 1.14 for financial 1.21 him 1.17 went 1.14 greatly 1.21 total 1.17 website 1.14 need this 1.21 mortgage rent 1.17 did not 1.14
69
Bi-gram (defaulted) Ratio Bi-gram (defaulted) Ratio Bi-gram (defaulted) Ratio
who 1.13 expense total 1.12 money pay 1.10
know 1.13 loan need 1.12 score 1.10
the reason 1.13 where 1.12 bring 1.10
remove 1.13 for year 1.12 and ha 1.10
date 1.13 will allow 1.12 service 1.10
been with 1.13 once 1.12 have work 1.10
these 1.13 can see 1.12 found 1.10
contract 1.13 and get 1.12 stay 1.10 and that 1.13 loan which 1.12 oblig 1.10 self 1.13 leave 1.12 get out 1.10 what 1.13 owner 1.12 open 1.10 our credit 1.13 loan help 1.12 all the 1.10 back the 1.13 their 1.11 profit 1.10 they 1.13 turn 1.11 from the 1.10 for month 1.12 not have 1.11 leas 1.11 payment time 1.12 month for 1.11 year the 1.10 request 1.12 wonder 1.11 time the 1.10 due the 1.12 that wa 1.11 tri 1.10 deposit 1.12 gone 1.11 mine 1.10 due 1.12 did 1.11 take the 1.10 each month 1.12 give 1.11 abl pay 1.10 perfect 1.12 our 1.10 love 1.10 afford 1.12 mortgage 1.10 and now 1.10
70
Table A5: Top 120 variables with the highest importance in the Random Forest analysis
Variables Importance Variables Importance Variables Importance Variables Importance
LenderRate 0.056502 SMOG 0.003126 whi you 0.002529 look 0.002311
Credit Grade A 0.041125 get back 0.003051 abl 0.002518 Gender Female 0.002301
CreditGrade HR 0.027996 daughter 0.003031 graduat 0.002515 total 0.002292 Amount Requested 0.018964 our 0.003023 never 0.002509 low 0.002280
Credit Grade E 0.012061 student loan 0.003009 famili 0.002508 bill and 0.002247
Credit Grade AA 0.010833
explain what 0.002992 will abl 0.002507
prosper loan 0.002226
Borrower Homeownership
0.009590 start 0.002965 colleg 0.002505 you for 0.002223
Credit Grade D 0.007955 behind 0.002953 for year 0.002503 son 0.002221 Prior Listings 0.007312 due 0.002948 gas 0.002469 pay thi 0.002218
payday loan 0.005989 interest rate 0.002919 last 0.002446 report 0.002217
Credit Grade C 0.005112 view list 0.002909 medic 0.002440 paid off 0.002210 Far West 0.005093 lend 0.002902 fund 0.002436 old 0.002208
busi 0.005073 Spell Checker 0.002831 what 0.002431 list 0.002200
Middle East 0.005069 card debt 0.002821 tri 0.002412 Rocky Mountain 0.002183
borrow 0.004620 again 0.002792 loan and 0.002404 even 0.002171 invest 0.004574 Age 0.002774 live 0.002397 use thi 0.002169 than 0.004494 whi 0.002758 they 0.002394 god 0.002164 Debt To Income 0.004327 them 0.002754 mortgag 0.002374 compani 0.002163 # of words in title 0.004297 balanc 0.002734 pay for 0.002370 ha been 0.002161 % words with 6 or more letters 0.004277 with the 0.002706 reinvest 0.002370
have never 0.002150
hard 0.004270 estat 0.002705 who 0.002368 the balanc 0.002148 Race - White 0.004196 what you 0.002687 and will 0.002366 these 0.002148 thank you 0.004131 you are 0.002673 real estat 0.002349 see 0.002116 person 0.004131 pay back 0.002654 into 0.002346 the time 0.002115 payday 0.004053 pleas 0.002644 more than 0.002341 know 0.002113 # of words in Description 0.003848 back thi 0.002618 just need 0.002339 rather 0.002109
explain 0.003794
Race - African American 0.002569 purchas 0.002335 promis 0.002103
Gender - Male 0.003756 score 0.002556 total monthli 0.002334 hi 0.002093
save 0.003476 Plains Regions 0.002548 plan 0.002328 support 0.002090
student 0.003193 and the 0.002537 husband 0.002315 give 0.002085
71
Table A6: Summary statistics of dataset of all loan requests (n = 122,479)
Variables Min Max Mean SD Freq. Amount requested 1,000 25,000 7,411.1 6,189.4 Debt-to-income ratio 0 10.01 .54 1.33 Lender interest rate 0 .350 .196 .092 Number of words in description 1 782 171.6 122.96 # Prior Listings 0 67 0.90 2.06 Credit grade: AA 0.026 A 0.034 B 0.055 C 0.105 D 0.160 E 0.181 HR 0.436 Loan status (1 = Funded, 0 = Expired) 0.150 Loan image dummy 0.498 Home owner dummy 0.357
72
Table A7a: Bi-grams that appeared frequently in funded loans
Bi-gram (funded) Ratio Bi-gram (funded) Ratio Bi-gram (funded) Ratio reinvest 4.70 com 1.73 averag 1.53 relist 3.36 below 1.71 off with 1.53 prosper lender 3.36 for prosper 1.71 add 1.53 excel credit 2.54 lender 1.71 review 1.53 prosper and 2.51 consult 1.70 remain 1.53 total expens 2.47 post 1.70 properti 1.53 bid 2.46 loan thank 1.69 list and 1.52 thi prosper 2.36 loan request 1.69 entir 1.52 group 2.27 miss payment 1.68 expect 1.52 feel free 2.24 fund 1.67 cover 1.52 invest 2.22 fulli 1.67 than the 1.51 card balanc 2.21 earli 1.66 i'll 1.51 revolv 2.19 previou 1.66 solid 1.51 question 2.17 www 1.65 addit 1.50 dti 2.12 have over 1.64 prosper will 1.50 have excel 2.11 public 1.64 spend 1.50 after tax 2.10 grade 1.64 share 1.49 verifi 2.09 intend 1.64 the minimum 1.49 ani question 2.08 and plan 1.64 annual 1.49 lend 2.07 can see 1.63 base 1.48 america 2.04 prosper payment 1.62 your consider 1.48 from prosper 2.03 have never 1.62 the cost 1.48 for consid 2.03 develop 1.62 never been 1.48 cash flow 2.02 late payment 1.62 tax 1.48 prosper loan 2.01 line 1.62 price 1.48 i'd 1.96 excel 1.61 comput 1.47 flow 1.95 the balanc 1.61 request 1.47 rental 1.94 plan pay 1.61 capit 1.47 equiti 1.92 monthli incom 1.61 paid full 1.47 cover the 1.89 expens are 1.60 site 1.47 bonu 1.89 you have 1.60 you can 1.47 origin 1.87 see have 1.60 avail 1.47 list 1.87 not includ 1.59 summer 1.46 incom after 1.86 lower interest 1.58 wed 1.46 misc 1.82 delinqu 1.58 system 1.46 answer 1.81 the remain 1.58 experi 1.46 record 1.81 firm 1.58 travel 1.45 contribut 1.81 easili 1.58 and thank 1.45 thank for 1.80 figur 1.57 cost 1.45 balanc 1.80 approxim 1.57 payment thi 1.45 borrow 1.79 side 1.56 plan 1.45 the prosper 1.79 usual 1.56 for your 1.45 card with 1.77 cash 1.56 worth 1.45 project 1.77 commun 1.56 minimum payment 1.44 with prosper 1.76 profession 1.56 schedul 1.44 engin 1.75 left over 1.55 earn 1.44 default 1.75 univers 1.55 for view 1.44 gross 1.74 case 1.54 student loan 1.44 never miss 1.74 minimum 1.54 cell 1.43 rather 1.74 valu 1.53 term 1.43
73
Bi-gram (funded) Ratio Bi-gram (funded) Ratio Bi-gram (funded) Ratio
again for 1.43 follow 1.34 the amount 1.28 replac 1.43 account and 1.34 incom from 1.28 ratio 1.42 salari 1.34 toward 1.28 activ 1.42 decid 1.34 sale 1.27 profil 1.41 consider 1.34 abov 1.27 save 1.41 down the 1.34 paid off 1.27 requir 1.41 detail 1.34 payment the 1.27 loan becausei 1.41 amount 1.33 thank you 1.27 loan from 1.41 june 1.33 such 1.27 gener 1.41 the bank 1.33 deduct 1.27 enjoy 1.41 graduat 1.33 client 1.27 hello 1.40 for take 1.33 from the 1.27 rate and 1.40 low 1.33 free 1.27 the first 1.40 websit 1.33 between 1.27 lower 1.40 loan credit 1.33 take the 1.27 view list 1.40 save and 1.33 the next 1.26 number 1.40 alreadi 1.32 higher interest 1.26 debt incom 1.40 loan with 1.32 ga 1.26 have ani 1.40 top prioriti 1.32 each 1.26 been late 1.39 off thi 1.32 teacher 1.26 guarante 1.39 etc 1.32 short 1.26 first 1.39 quickli 1.32 less 1.26 see 1.39 your 1.31 higher 1.26 view 1.39 than 1.31 member 1.26 you for 1.39 paid for 1.31 have veri 1.26 immedi 1.38 each month 1.31 more than 1.26 bank 1.38 book 1.31 extrem 1.25 profit 1.38 combin 1.31 account 1.25 credit histori 1.38 includ 1.31 use credit 1.25 student 1.38 juli 1.31 limit 1.25 will paid 1.38 total 1.30 histori 1.25 have credit 1.37 contact 1.30 emerg 1.25 inform 1.37 interest rate 1.30 offic 1.25 groceri 1.37 never 1.30 level 1.25 car insur 1.37 custom 1.29 last 1.25 have alreadi 1.37 over year 1.29 ani 1.25 year the 1.36 goe 1.29 thought 1.25 one the 1.36 pretti 1.29 card have 1.25 wife' 1.36 the other 1.29 expand 1.25 payment prosper 1.36 electr 1.29 strong 1.24 less than 1.36 three year 1.29 the end 1.24 here 1.35 larg 1.29 charg 1.24 alway pay 1.35 the interest 1.29 i'm not 1.24 citi 1.35 automat 1.29 teach 1.24 incom ratio 1.35 should 1.29 becausei 1.24 pay down 1.35 servic 1.29 real estat 1.24 market 1.35 pictur 1.28 the follow 1.24 note 1.35 miss 1.28 complet 1.24 cell phone 1.34 increas 1.28 oper 1.24 bought 1.34 risk 1.28 payment will 1.24 reduc 1.34 sourc 1.28 late 1.24 current have 1.34 the busi 1.28 month that 1.24
74
Bi-gram (funded) Ratio Bi-gram (funded) Ratio Bi-gram (funded) Ratio
dure the 1.24 top 1.18 four 1.14 close 1.24 contract 1.18 reflect 1.14 with thi 1.24 around 1.18 fact 1.14 locat 1.24 everyon 1.18 long 1.14 estat 1.24 area 1.18 rest 1.14 local 1.23 manag 1.17 remov 1.14 offer 1.23 cours 1.17 while 1.14 save for 1.23 far 1.17 wife and 1.14 room 1.23 over the 1.17 the rest 1.14 actual 1.23 the fund 1.17 the credit 1.14 sold 1.23 industri 1.17 process 1.14 bit 1.22 promot 1.17 may 1.14 half 1.22 field 1.16 class 1.14 success 1.22 deal 1.16 month ago 1.14 auto 1.22 after 1.16 well 1.13 consid 1.22 purchas 1.16 oblig 1.13 return 1.22 month have 1.16 show 1.13 use for 1.22 those 1.16 loan which 1.13 about month 1.22 almost year 1.16 sinc 1.13 payment have 1.22 the compani 1.16 doe not 1.13 within 1.21 use the 1.16 approx 1.13 next 1.21 the year 1.16 apart 1.13 i'v 1.21 current employ 1.15 that the 1.13 i'v been 1.21 your time 1.15 month for 1.13 the monthli 1.21 year now 1.15 most 1.12 mean 1.21 quit 1.15 read 1.12 prioriti 1.21 tuition 1.15 stabl job 1.12 friend 1.21 wonder 1.15 lesson 1.12 card debt 1.21 period 1.15 same 1.12 next year 1.21 continu 1.15 appli 1.12 time everi 1.21 licens 1.15 time the 1.12 owner 1.21 sell 1.15 year have 1.12 are paid 1.21 compani for 1.15 further 1.12 about year 1.20 except 1.15 debt free 1.12 the new 1.20 compani 1.15 the last 1.12 employe 1.20 drive 1.15 the time 1.12 extra 1.20 major 1.15 expens and 1.12 leas 1.20 benefit 1.15 into the 1.12 per month 1.20 been employ 1.14 interest credit 1.12 year with 1.20 reason 1.14 and i'm 1.12 per 1.19 ad 1.14 normal 1.12 happi 1.19 build 1.14 finish 1.12 for over 1.19 card that 1.14 three 1.12 thi year 1.19 car payment 1.14 that thi 1.11 water 1.19 big 1.14 seek 1.11 real 1.19 think 1.14 dure 1.11 work the 1.19 perfect 1.14 posit 1.11 commit 1.19 thi debt 1.14 retir 1.11 veri respons 1.19 both 1.14 young 1.11 loan off 1.19 off the 1.14 everi month 1.11 grow 1.19 two year 1.14 august 1.11 refin 1.18 payment for 1.14 time have 1.11
75
Bi-gram (funded) Ratio Bi-gram (funded) Ratio Bi-gram (funded) Ratio
the high 1.11 doe 1.10 their 1.10 own 1.11 elimin 1.10 store 1.10 unexpect 1.11 bill time 1.10 two 1.10 few month 1.11 product 1.10 for our 1.10 feel 1.11 good credit 1.10 prior 1.10 down 1.11 incom and 1.10 mortgag 1.10 within the 1.11 last year 1.10 coupl 1.10 result 1.11 almost 1.10 attend 1.10 the same 1.11 own home 1.10 the mortgag 1.10 possibl 1.11
76
Table A7b: Bi-grams that appeared frequently in unfunded loans
Bi-gram (unfunded) Ratio Bi-gram (unfunded) Ratio Bi-gram (unfunded) Ratio situat explain 4.18 debt financi 1.70 and help 1.41 loan explain 4.12 budget mortgag 1.68 god 1.41 explain what 3.90 off all 1.68 stress 1.40 explain whi 3.88 help get 1.67 like pay 1.40 whi you 3.84 pleas help 1.67 that can 1.40 what you 3.74 payment other 1.63 get thi 1.40 for financi 3.69 veri hard 1.63 son 1.38 for pay 3.55 take care 1.62 into one 1.38 are good 3.50 off debt 1.62 loan pay 1.38 explain 3.49 hard 1.62 afford 1.38 you are 3.26 divorc 1.62 back the 1.38 back thi 3.09 mother 1.60 money pay 1.38 catch 3.06 need thi 1.60 get out 1.38 you will 2.58 medic bill 1.60 off credit 1.37 get back 2.54 honest 1.60 medic 1.37 back track 2.53 can't 1.59 payday 1.37 pay back 2.44 tri 1.58 job and 1.37 loan monthli 2.35 good job 1.57 loan would 1.36 someon 2.29 and just 1.56 the bill 1.36 loan for 2.28 rent insur 1.55 loan consolid 1.36 use thi 2.27 need pay 1.51 expens total 1.36 behind 2.26 and need 1.51 child support 1.36 chanc 2.24 clean 1.5 help pay 1.35 just need 2.21 can get 1.5 got 1.35 whi 2.21 ahead 1.49 have had 1.35 off some 2.01 children 1.48 famili 1.35 tri get 1.97 thing 1.48 better 1.34 worker 1.97 prove 1.48 have learn 1.34 one payment 1.97 surgeri 1.48 rebuild credit 1.34 loan need 1.90 everyth 1.47 mistak 1.34 track 1.89 mom 1.47 given 1.33 bill and 1.86 kid 1.46 singl 1.33 need help 1.84 mortgag rent 1.45 our credit 1.33 what 1.79 daughter 1.45 and want 1.32 lost 1.78 and had 1.44 debt that 1.32 bad 1.77 rebuild 1.44 clear 1.32 dont 1.73 want pay 1.43 went 1.32 and start 1.73 work hard 1.43 child 1.32 payday loan 1.73 can pay 1.43 debt and 1.32 and get 1.71 hard work 1.41 abl pay 1.32
77
Bi-gram (unfunded) Ratio Bi-gram (unfunded) Ratio Bi-gram (unfunded) Ratio
when wa 1.31 and wa 1.23 car and 1.17 life 1.30 loan back 1.23 not have 1.16 give 1.30 realli 1.23 issu 1.16 help out 1.29 ask for 1.23 them 1.16 will abl 1.29 work and 1.23 could 1.16 care 1.29 all credit 1.22 the past 1.16 all debt 1.29 stand 1.22 have steadi 1.16 and now 1.29 use help 1.22 will make 1.16 and can 1.28 situat have 1.22 consolid credit 1.16 difficult 1.28 and make 1.22 year 1.15 paycheck 1.28 monthli budget 1.21 off high 1.15 will help 1.28 have made 1.21 month monthli 1.15 pay them 1.28 truck 1.21 would like 1.15 consolid 1.28 year monthli 1.21 she 1.15 turn 1.27 husband and 1.20 loan financi 1.14 hospit 1.27 improv credit 1.20 the best 1.14 start 1.27 him 1.20 sever 1.14 help with 1.27 pass 1.20 assist 1.14 seem 1.27 happen 1.20 job for 1.14 past 1.27 due 1.20 file 1.14 know that 1.26 father 1.20 and would 1.14 husband 1.26 myself 1.20 will pay 1.13 outstand 1.26 time job 1.20 and credit 1.13 caus 1.26 problem 1.19 her 1.13 want 1.26 have some 1.19 ga util 1.13 dream 1.25 all the 1.19 that have 1.13 right now 1.25 now and 1.19 say 1.13 but have 1.25 the opportun 1.19 loan payment 1.13 off and 1.24 and that 1.18 school and 1.13 know 1.24 becaus the 1.18 look for 1.13 job with 1.24 and work 1.18 find 1.13 decis 1.24 now have 1.18 all our 1.13 have one 1.24 old 1.18 and also 1.13 the purpos 1.24 repair 1.17 meet 1.13 some credit 1.24 live with 1.17 someth 1.12 work full 1.24 and pay 1.17 god bless 1.12 that are 1.24 bring 1.17 pay bill 1.12 that need 1.23 need the 1.17 card financi 1.12 all bill 1.23 gone 1.17 hi 1.12 bless 1.23 greatli 1.17 payoff 1.12
78
Bi-gram (unfunded) Ratio Bi-gram (unfunded) Ratio Bi-gram (unfunded) Ratio
credit score 1.12 would have 1.11 let 1.11 order 1.12 onc 1.11 money and 1.10 work with 1.12 becom 1.11 and not 1.10 money for 1.12 obtain 1.11 loan have 1.10 direct 1.12 until 1.11 steadi 1.10 unfortun 1.12 taken 1.11 and take 1.10 pleas 1.11 and then 1.11 vehicl 1.10 that would 1.11 won't 1.11 loan the 1.10 incur 1.11 credit score 1.12 week 1.10 pay all 1.11 score 1.11
79
Table A8: lists of the top 30 words with the highest relevance measure for each LDA topic
Topic: Employment and School
Relevance Topic: Interest Rate Reduction
Relevance Topic: Expense Explanation Relevance
work -0.42678 debt 1.12548 expens 0.97497 job -0.42930 interest 0.98818 explain 0.79611 full -0.86215 rate 0.95097 cloth 0.77201 school -0.89078 high 0.93643 entertain 0.75424 year -1.07493 consolid 0.87551 cabl 0.75197 colleg -1.09435 score 0.86722 whi 0.70915 incom -1.14446 improv 0.74883 util 0.70817 employ -1.14622 lower 0.69804 insur 0.60867 student -1.21860 balanc 0.66615 monthli 0.39539 part -1.30739 card 0.64182 cardsmi 0.18691 financi -1.33920 histori 0.62272 purpos 0.17408 steadi -1.44914 higher 0.61019 billsmi 0.15935 stabl -1.46063 payoff 0.59608 hous 0.15739 graduat -1.48860 reduc 0.53922 debtmi 0.10787 loan -1.50922 elimin 0.52508 monthmonthli 0.06338 secur -1.53972 minimum 0.51484 timemonthli -0.00063 degre -1.56564 outstand 0.51241 situat -0.00922 monthli -1.58314 low 0.50942 incomemonthli -0.02425 educ -1.59005 rid 0.50788 iam -0.02954 hour -1.60727 ratio 0.50312 loansmi -0.04908 retir -1.62923 goal 0.46530 card -0.06722 finish -1.63329 revolv 0.44719 loanmi -0.06818 veri -1.63949 refin 0.43334 paymentmi -0.08156 start -1.65941 incur 0.39694 businessmi -0.09741 repair -1.68777 oblig 0.37474 consolidationmi -0.10197 thi -1.70717 default 0.36558 loanmonthli -0.11064 wed -1.73917 faster 0.34038 debtsmi -0.12716 summer -1.76588 sooner 0.32769 incom -0.15812 career -1.78935 miss 0.32551 yearsmonthli -0.16602 buy -1.79148 apr 0.31495 cardmi -0.16639
80
Topic: Business and Real Estate
Relevance Topic: Family Relevance
Topic: Loan Details and Explanations
Relevance Topic: Monthly Payment
Relevance
busi -0.72468 bill -0.80362 loan 0.02223 month -0.46588 purchas -1.22119 tri -0.97478 thi -0.23886 payment -0.62683 compani -1.24243 famili -1.06365 becaus -0.39797 paid -0.82288 invest -1.45524 life -1.13109 candid -0.47236 total -1.03183 fund -1.49916 husband -1.19515 situat -0.91190 account -1.03609 addit -1.51557 medic -1.21212 financi -0.92870 rent -1.04262 properti -1.55064 thing -1.29017 purpos -0.93624 mortgag -1.06368 market -1.56694 littl -1.34994 hous -0.98446 save -1.14949 build -1.57312 realli -1.35012 expens -1.11843 list -1.23411 cost -1.61591 care -1.37252 monthli -1.12610 everi -1.27680 sell -1.62173 give -1.38193 incom -1.55843 payday -1.29964 servic -1.62783 children -1.41594 entertain -1.84658 budget -1.32913 sale -1.64196 hard -1.43166 cloth -1.90259 report -1.33987 real -1.71844 daughter -1.44836 cabl -1.92371 bank -1.36469 alreadi -1.74378 chanc -1.45170 util -1.95598 tax -1.39207 success -1.77930 son -1.46681 insur -2.02035 includ -1.42280 open -1.77941 money -1.50048 bill -2.17176 current -1.43029 provid -1.78311 divorc -1.52915 card -2.20032 amount -1.48126 base -1.79599 move -1.54901 canid -2.73682 check -1.51827 home -1.82519 track -1.55675 ontim -3.59224 delinqu -1.55218 estat -1.83556 everyth -1.55791 honest -3.98226 amp -1.55378 profit -1.85000 abl -1.57058 thanksmonthli -4.09312 onli -1.55612 grow -1.86056 bad -1.57478 alway -4.30764 day -1.60811 offic -1.87716 mother -1.58898 vacat -4.30948 left -1.60841 run -1.88103 home -1.63274 buy -4.83280 sinc -1.69568 product -1.88769 child -1.64554 payback -4.97132 bankruptci -1.69810 store -1.89849 kid -1.67359 trustworthi -5.35212 fee -1.72871 rental -1.90456 put -1.67912 fix -5.43237 auto -1.74107 industri -1.91004 pleas -1.68211 catch -6.89838 owe -1.75320 area -1.92412 live -1.68560 track -7.09870 rebuild -1.76605
81
Figure A1: LDA analysis – selecting the number of topics based on perplexity
Note: we measure perplexity as 𝑝𝑒𝑟𝑝𝑙𝑒𝑥𝑖𝑡𝑦 = − s(t)uvwE@vxtvyz^
, where 𝐿(𝑤) is the log-likelihood of the test data documents. Thus, perplexity is decreasing in likelihood (lower perplexity means better fit).
285
290
295
300
305
310
315
320
2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30
Prep
lexi
ty
# of topics