when words sweat: identifying signals for loan default in ...€¦ · the amount of money they wish...

81
When Words Sweat: Identifying Signals for Loan Default in the Text of Loan Applications Oded Netzer Associate Professor of Marketing Columbia Business School Columbia University [email protected] Alain Lemaire Doctoral student Columbia Business School Columbia University [email protected] Michal Herzenstein Associate Professor of Marketing Lerner College of Business and Economics University of Delaware [email protected] March 2018 Equal authorship. We thank Columbia Business School and the Lerner College at the University of Delaware for financial support.

Upload: others

Post on 26-Jul-2020

0 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

When Words Sweat:

Identifying Signals for Loan Default in the Text of Loan Applications

Oded Netzer Associate Professor of Marketing Columbia Business School Columbia University [email protected] Alain Lemaire Doctoral student Columbia Business School Columbia University [email protected] Michal Herzenstein Associate Professor of Marketing Lerner College of Business and Economics University of Delaware [email protected]

March 2018 Equal authorship. We thank Columbia Business School and the Lerner College at the University of Delaware for financial support.

Page 2: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

2

When Words Sweat:

Identifying Signals for Loan Default in the Text of Loan Applications

Abstract

The authors present empirical evidence that borrowers, consciously or not, leave traces of their

intentions, circumstances, and personality traits in the text they write when applying for a loan.

This textual information has a substantial and significant ability to predict whether borrowers

will pay back the loan over and beyond the financial and demographic variables commonly used

in models predicting default. The authors use text-mining and machine-learning tools to

automatically process and analyze the raw text in over 120,000 loan requests from Prosper.com,

an online crowdfunding platform. The authors find that loan requests written by defaulting

borrowers are more likely to include words related to their family, mentions of God, the

borrower’s financial and general hardship, pleading lenders for help, and short-term focused

words. The authors further observe that defaulting loan requests are written in a manner

consistent with the writing style of extroverts and liars. Using a counterfactual analysis, the

authors demonstrate that applying their finding can yield a 9.7% additional return on investment.

Keywords: text mining, consumer finance, loan default

Page 3: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

3

Imagine you consider lending $2,000 to one of two borrowers on a crowdfunding

website. The borrowers are identical in terms of their demographic and financial characteristics,

the amount of money they wish to borrow, and the reason for borrowing the money. However,

the text they provided when applying for a loan differs: Borrower #1 writes “I am a hard working

person, married for 25 years, and have two wonderful boys. Please let me explain why I need

help. I would use the $2,000 loan to fix our roof. Thank you, god bless you, and I promise to pay

you back.” while borrower #2 writes “While the past year in our new place has been more than

great, the roof is now leaking and I need to borrow $2,000 to cover the cost of the repair. I pay

all bills (e.g., car loans, cable, utilities) on time.” Who is more likely to default on her loan? This

question is at the center of our research, as we investigate the power of words in predicting loan

default. We claim and show that the text borrowers write at loan origination provides valuable

information that cannot be otherwise extracted from the typical data lenders have on borrowers

(which mostly include financial and demographic data), and that additional information is crucial

to predictions of default. The idea that text written by borrowers can predict their loan default,

builds on recent research showing that text is indicative of people’s psychological states, traits,

opinions, and situations (e.g., Humphreys and Jen-Hui Wang 2018; Matz and Netzer 2017), as

well as providing information on consumers’ and markets’ behaviors, such as consumers top of

mind associations (Netzer et al. 2012) and stock prices (Tirunillai and Tellis 2012).

In essence, the decision whether or not to grant a loan depends on the lender’s assessment

of the borrower’s ability to repay the loan. But this assessment is often difficult because loans are

repaid over a lengthy period of time, during which unforeseen circumstances may arise. For that

reason, traditional lenders (such as banks) and researchers collect and process as many pieces of

Page 4: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

4

information as possible within this tightly regulated industry1. These pieces of information can

be classified into four categories: (1) The borrower’s financial strength, which is reflected by

one’s credit history, FICO score, income, and debt (Mayer, Pence, and Sherlund 2009), and is

most telling of the borrower’s ability to default (Avery et al. 2000); (2) Demographics, such as

gender or geographic location (Rugh and Massey 2010); (3) Information related to the loan

itself—the amount that is being borrowed and the interest rate (Gross et al. 2009); (4) Everything

else that can be learned from human interactions between borrowers and people at the loan

granting institutions. Indeed, Agarwal and Hauswald (2010) found that supplementing the loan

application process with the human touch of loan officers significantly decreases default rate due

to better screening and higher interpersonal commitment from borrowers. However, these

interactions are often laborious and expensive.

Indeed, in recent years, human interactions between borrowers and lenders have been

largely replaced by online lending platforms operated by banks, other lending institutions, or

crowdfunding platforms. In such environments, the role of both hard and soft pieces of

information becomes crucial. Accordingly, our main proposition in this paper is that the text

potential borrowers write when requesting an online crowdfunded loan provides additional

important information on the borrower, such as their intentions, personality, and true

circumstances—information that cannot be deduced from the financial and demographic data

alone and in a sense analogous to body language (or unspoken language) detected by loan

officers. Furthermore, because the evaluation of the borrower by a loan officer has been shown

to provide additional information over and beyond the hard financial information (Agarwal and

Hauswald 2010) we contend that in a similar vein, the text borrowers write when applying for a

1 Indeed, our personal discussions with data scientists at banks and other lending institutions suggest that they analyze as many as several thousands data points.

Page 5: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

5

crowdfunded loan is predictive of loan default above and beyond all other available information.

This hypothesis extends the idea that our demeanor can be a manifestation of our true intentions

(DePaulo et al. 2003) into the text we write.

Going back to the opening example, to answer the question “who is more likely to

default?” we apply text-mining and machine-learning tools to a dataset of over 120,000 loan

requests from the crowdfunding platform Prosper. Using an ensemble stacking approach that

includes tree-based methods and regularized logistic regressions, we find that the textual

information significantly improves predictions of default, increasing lenders ROI over an

approach that uses only financial and demographics information by as much as 9.7%. To learn

which words, writing styles, and general ideas conveyed by the text, are more likely to be

associated with defaulted loan requests we further analyzed the data using a multi-method

approach including a naïve Bayes, an L1 regularization binary logistic model, a latent Dirichlet

analysis (LDA) analysis, and the Linguistic Inquiry and Word Count dictionary (LIWC;

Tausczik and Pennebaker 2010). Results across the first three analyses consistently show that

loan requests written by defaulting borrowers are more likely to include words (or themes)

related to the borrower’s family, financial and general hardship, mentions of God, and the near

future, as well as pleading lenders for help, and using verbs in present and future tenses.

Therefore, the text and writing style of borrower #1 in our opening example suggest this person

is more likely to default. In fact, all else equal, our analysis shows that based on the loan request

text, borrower #1 is approximately eight times more likely to default relative to borrower #2.

These analyses demonstrate the successful use of machine learning tools to go beyond merely

predictions with the objective of inferring the words, topics, and writing styles that are most

associated with a behavioral outcome. Additionally, we applied the LIWC dictionary to our data,

Page 6: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

6

to allow a deeper exploration into the potential traits and states of borrowers. Our results suggest

that defaulting loan requests are written in a manner consistent with the writing styles of

extroverts and liars. We do not claim that defaulting borrowers were intentionally deceptive

when they wrote the loan request; rather, we believe their writing style may have reflected,

intentionally or not, doubts in their ability to repay the loan.

Our examination into the manifestation of consumers’ personalities and states in the text

they write during loan origination contributes to the fast-growing literature in consumer financial

decision making. Recently consumer researchers started to investigate factors that encourage or

inhibit consumer saving (Dholakia et al. 2016; Soman and Cheema 2011), debt acquisition

(Herzenstein, Sonenshein, and Dholakia 2011), and repayment (Amar et al. 2011). Most of these

investigations have been done on a smaller scale, such as with experimental participants or

smaller datasets. We add to this literature by showing, on a large scale and with archival data,

how participants in the crowdfunding industry can better assess the risk of default by interpreting

and incorporating soft unverifiable data (the words borrowers write) into their analysis.

The rest of this paper is organized as follows. In the next section we discuss the

limitations of structured financial and demographic data in the context of consumer financial

decisions and the opportunity offered by leveraging textual information. Specifically, drawing on

extant literature we argue that credit scores and demographics miss important information about

borrowers that lenders can learn from mining the text these borrowers write. We then delineate

the data, our text-mining and modeling approaches, and results. We conclude by interpreting our

results and generalizing our approach to other contexts.

WHAT DO HARD DATA MISS? THE OPPORTUNITY IN TEXT

Financial data such as credit scores have been used extensively to predict consumers’

Page 7: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

7

credit riskiness, which is the likelihood they will become seriously delinquent on their credit

obligations over the next two years. FICO, the dominant credit scoring model, is used by 90% of

all financial institutions in the U.S. in their decision-making process (according to

myFICO.com). In calculating consumers’ scores, FICO takes into account a multitude of data

points including past and current loans, and credit cards—their allowance and utilization as well

as any delinquencies that were reported to the credit bureau (by companies such as cable TV, cell

phone providers etc.) However, while the usage of credit scores and financial information clearly

has its benefits in predicting consumer risk, these measures have been found to be insufficient

and even biased. For example, Avery et al. (2000) argue that the reliance of credit scores on

available consumers’ credit history misses important factors such as health status and length of

employment which are more forward looking in nature. Credit scores are a snapshot of the past

and do not reflect any (positive or negative) change that may occur in the future. The authors

find that those with low credit scores often benefit most from such additional data. Agarwal,

Skiba, and Tobacman (2009) provide further evidence for this deficiency of credit scores by

showing that the more flexible credit score systems, which weigh the different data sources that

comprise the score differently for each person, are eight times better at predicting loan default

than the more rigid systems such as the FICO score. Put differently, FICO scores might predict

future financial behavior well for some but not others. For example, if a person is facing some

unexpected health related expenditures that will ultimately drain their reserves and lower their

credit score, FICO will only know about it after it happens. A credit scoring system that

incorporates health status will predict the financial situation of that person better.

Analyzing the subprime mortgage default during 2006-7, Palmer (2015) and Sengupta

and Bhardwaj (2015) show that current contextual information such as loan characteristics (e.g.,

Page 8: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

8

loan amount) and economic characteristics (e.g., housing prices) are often more predictive of

mortgage default than traditional financial measures such FICO scores and debt-to-income ratio.

Thus, credit scores are missing important information that may be predictive of consumers’

financial health and that the heavy reliance on such scores might blind lending organizations

from looking at additional sources of information.

Recognizing the limitation of financial measures, financial institutions and researchers

added other variables to their predictive models, mostly related to the individuals’ demographic,

geographic, and psychographic characteristics (e.g., Barber and Odean 2001; Pompian and Longo

2004). However, other characteristics, such as those related to emotional and mental states, as

well as personalities, have been found to be closely tied to financial behaviors and outcomes

(Norvilitis et al. 2006; Rustichini et al. 2016), yet are still missing from those predictive models.

The problem of reliance on purely structured financial and demographic information is even more

severe in the growing realm of crowdfunded unsecured loans, because human interaction between

lenders and borrowers is scarce. But, therein lies the problem. Personalities and mental states are

difficult, if not impossible, to infer from demographic or financial data alone, thus the pressing

need to go beyond such traditional data when attempting to predict behavior. We suggest and

demonstrate that the text borrowers write at loan origination can provide the much needed

supplemental information. Specifically, this textual information can be useful in understanding

not only the past behavior of the borrower, but also the present context of the loan and the

borrower’s future intentions.

WORDS, INDIVIDUAL CHARACTERISTICS, AND FINANCIAL BEHAVIORS

Our proposition that text can predict default builds on research in marketing, psychology,

linguistics, and finance that establishes two connections: (1) word usage and writing styles are

Page 9: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

9

indicative of some stable inner traits as well as more transient states; and (2) these traits and

states affect people’s financial behaviors. In this section we provide evidence for these two

connections.

The premise that the text may be indicative of deeper traits and emotional states is

predicated on the idea that there is a systematic relationship between the words people use and

their personality traits (Fast and Funder 2008; Hirsh and Peterson 2009), identities (McAdams

2001), and emotional states (Tausczik and Pennebaker 2010). The relationship between word

usage and personality traits has been found across multiple textual media such as essays about

the self (Hirsh and Peterson 2009), blogs (Yarkoni 2010), social media (Schwartz et al. 2013),

and daily speeches (Mehl, Gosling, and Pennebaker 2006). This relationship stems from the

human tendency to tell stories and express internal thoughts and emotions through these stories,

which are essentially made possible by language. Therefore, even if the content might be similar

across different individuals, the manner in which they convey that content differs.

Research over the last two decades established the association between word usage with

the text originator’s personality, focusing on the big five personality traits—extraversion,

agreeableness, conscientiousness, neuroticism, and openness (Kosinski, Stillwell, and Graepel

2013; Pennebaker and Graybeal 2001; Pennebaker and King 1999), physical and mental health

(Preotiuc-Pietro et al. 2015), age and gender (Pennebaker and Stone 2003; Schwartz et al. 2013),

emotional state (Pennebaker, Mayne, and Francis 1997), impulsivity (Arya, Eckel, and Wichman

2013), and deception (Newman et al. 2003; Ott, Cardie, and Hancock, 2012).

The aforementioned literature conveys the potential in analyzing text—the ability to

extract a wealth of information otherwise unobtainable in many settings. However, without

understanding how this information can shed light on financial behavior, it has little to no value

Page 10: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

10

to our purpose. Indeed, many researchers examined people’s financial decision making from a

psychological or behavioral lens. For example, Anderson et al. (2011) show that credit scores are

correlated with the big five personality traits. Specifically, they find negative relations between

extroversion and conscientiousness and credit scores. Extroverts wish to have exciting life styles

and hence sometimes spend beyond their means. The result for conscientiousness is a bit

surprising because these are diligent and responsible people, however their need for achievement

is high which might induce a pattern of spending that is larger than their means. Berneth, Taylor,

and Walker (2011) found that agreeableness is negatively correlated with FICO scores because

these people aim to please and are less likely to say no to unnecessary expenses. The extent to

which such traits are predictive of financial behavior over and beyond credit scores is an

empirical question, which we investigate in this research.

Other individual characteristics have been also found to be related to financial behavior.

Arya, Eckel, and Wichman (2013) find that FICO scores are correlated with personality, time

and risk preference, trustworthiness, and impulsiveness; Nyhus and Webley (2001) show that

emotional stability, autonomy, and extroversion are robust predictors of saving and borrowing

behaviors; and Norvilitis et al. (2006) show that debt is related to delay of gratification. Financial

behaviors such as saving, taking loans, and credit card usage, as well as overall scores such as

FICO have also been found to be correlated with education (Nyhus and Webley 2001), financial

literacy (Fernandes, Lynch, and Netemeyer 2014), number of hours people work (Berneth,

Taylor, and Walker 2011), stress and physical wellbeing (Netemeyer et al. 2018), self-regulation

(Freitas et al. 2002), and even the number of social media connections (Wei et al. 2015).

In sum, we postulate that many of the behavioral characteristics and situational factors that

have been found to be related to financial behaviors, such as personality traits and future

Page 11: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

11

intentions, can be extracted from the text borrowers write in their loan request. Therefore, that text

is an invaluable addition to the hard financial and demographic data when predicting loan default.

SETTINGS AND DATA

We examine the value of text in predicting default using data from Prosper.com, the first

online crowdfunding platform and currently the second largest in the United States, with over 2

million members and $10 billion in funded unsecured loans. In prosper, potential borrowers

submit their request for a loan for a specific amount with a specific maximum interest rate they

are willing to pay, and lender then bid in a Dutch-like auction on the lender rate for loan. We

downloaded all loan requests posted between April 2007 and October 2008, a total of 137,952

listings. In October 2008, the Securities and Exchange Commission required Prosper to register

as a seller of investment, and when Prosper re-launched in July 2009 it made significant changes

to the platform. We chose data from the earlier days because it is richer and more diverse

particularly with respect to the textual information in the loan requests.

When posting a loan request on Prosper potential borrowers have to specify the loan

amount they wish to borrow (between $1,000 and $25,000 in our data), the maximum interest

rate they are willing to pay, and other personal information, such as debt to income ratio and

whether they are home owners. Prosper verifies all financial information including the potential

borrower’s credit score from Experian, and assigns each borrower a credit grade that reflects all

of this information. The possible credit grades are AA (lowest risk for lenders), A, B, C, D, E,

and HR (highest risk to lenders). Table A1 in the Web Appendix presents correspondence

between Prosper’s credit grades and FICO score. In addition, borrowers can upload as many

pictures as they wish, and use an open textbox to write any information they wish, with no length

restriction. The words borrowers write in that textbox are at the center of our research.

Page 12: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

12

Because we are interested in predicting default, we focus on those loan requests that were

funded (19,446 requests), and of them only on those that included text (18,312 loans). The

default rate in our sample of loans is 33.1% (35% in all funded loans, n = 19,446).

We automatically text-mined the raw text in each loan application using the tm package

in R. Our textual unit is a loan application. For each loan application, we first tokenize each

word, a process that breaks down each loan application into the distinct words it contains. We

then use Porter’s stemming algorithm, to collapse variations of words into one. For example,

“borrower,” “borrowed,” “borrowing,” and “borrowers” become “borrow”. In total, the loan

requests in our dataset have over 3.5 million words, corresponding to 30,920 unique words that

are at least 3 letters long (we excluded from our analysis numbers and symbols).2 In addition to

words/stems we also look at two-word combinations (an approach often referred to as n-gram, in

which for n = 2, we get bi-grams). To reduce the dimensionality of the textual data and avoid

more obscure or infrequent words, we focus our analyses on the most frequent stemmed words

and bi-grams that appeared in at least 400 loan requests. We are left with 1,032 bi-grams.3

Textual, Financial, and Demographic Variables

Our dependent variable is loan repayment/default as reported by Prosper 4 (binary: 1 =

paid in full, 0 = defaulted). Our data horizon ends in 2008, and all Prosper loans at the time were

to be repaid over three years or less, therefore we know whether each loan in our database was

repaid or defaulted—there are no other options. The set of independent variables we use includes

textual, financial, and demographic variables. We elaborate on each group next.

2 Because of stemming, words with less than 3 words such as “I” may be kept due to longer stems (e.g., I’ve). 3 We checked the robustness of our analyses to increasing the number of words and bi-grams included in the analysis. Our results did not change qualitatively when we increased the number of bi-grams. 4 We classified a loan as “defaulted” if the loan status in Prosper is “Charge-off,” “Defaulted (Bankruptcy),” or “Defaulted (Delinquency).” We classified a loan as “paid” if it is labeled “Paid in full,” “Settled in full,” or “Paid.”

Page 13: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

13

Textual variables. These variables include: (1) The number of characters in the title and

the textbox in the loan request. The length of the text has been associated with deception,

however the evidence is inconclusive. Hancock et al. (2007) showed that liars wrote much more

when communicating via text messages than non-liars. Similarly, Ott, Cardie, and Hancock

(2012) demonstrated that fake hospitality reviews are wordier though less descriptive. However,

in the context of online dating websites, Toma and Hancock (2012) showed that shorter profiles

indicate the person is lying, because they wished to avoid certain topics. (2) The percent of

words with six or more letters. This metric is commonly used to measure complex language,

education level, and social status (Tausczik and Pennebaker 2010). More educated people are

likely to have higher income and higher levels of financial literacy and hence are less likely to

default on their loan, relative to less educated people (Nyhus and Webley 2001). But the use of

complex language can also be risky if readers of the text perceive it to be artificially or

frivolously complex. Indeed, Oppenheimer (2006) demonstrated, in the context of admission

essays, that if complex vocabulary is used superfluously, the author may face a detrimental

outcome, suggesting the higher language was likely used deceptively. (3) The Simple Measure of

Gobbledygook (SMOG; McLaughlin, 1969), which measures writing quality by mapping it to

number of years of formal education needed to easily understand the text in first reading. (4) A

count of spelling mistakes based on the enchant spell checker using the Pyenchant 1.6.6. package

in Python. Harkness (2016) shows that spelling mistakes are associated with a lower likelihood

of granting a loan in traditional channels because it serves as a proxy for characteristics

correlated with lower income. (5) The 1,032 bi-grams from the open textbox in each loan

application following the text mining process described earlier.

Because loan requests differ in length, and words differ in the frequency of appearance in

Page 14: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

14

our corpus, we normalize the frequency of a word appearance in a loan request to its appearance

in the corpus and the number of words in the loan request using the term frequency–inverse

document frequency, tf-idf, measure commonly used in information retrieval. The term

frequency for word j in loan request m is defined by 𝑡𝑓#$ = 𝑋#$ 𝑁$⁄ , where 𝑋#$ is the number

of times word j appears in loan request m, and 𝑁$ is the number of words in loan request m. This

component controls for the length of the document. The inverse-document-frequency is defined

by 𝑖𝑑𝑓# = 𝑙𝑜𝑔.𝐷 𝑀#⁄ 1, where D is the number of loan requests and 𝑀# is the number of loan

requests in which word j appears. This terms controls for the how often a word appears across

documents. Tf-idf is given by: 𝑡𝑓 − 𝑖𝑑𝑓#$ = 𝑡𝑓#$ × (𝑖𝑑𝑓# + 1). Taken together, the tf-idf statistic

provides a measure of how likely a word is to appear in a document over and beyond chance.

Financial and Demographic Variables. The second type of variables we consider are

financial and demographic information, commonly used in traditional risk models. These include

all information available to lenders on Prosper—loan amount, borrower’s credit grade (modeled

as a categorical variable AA-HR), debt to income ratio, whether the borrower is a home owner,

the bank fee for payment transfers, whether the loan is a relisting of a previous unsuccessful loan

request, and whether the borrower included a picture with the loan. In order to fully account for

all the information lenders have when viewing a loan request, we extracted information included

in the borrower’s profile pictures, such as gender (Male, Female and “Cannot Tell”), age

brackets (Young, Middle-aged, Old), and race (Caucasian, African American, Asian, Hispanic,

or “Cannot Tell”) using human coders. See Web Appendix for details of the coding procedure.5

Additionally, we controlled for the geographical location of the borrower to account for

5 In addition to coding the demographics, we asked our judges to provide a score for attractiveness and trustworthiness for each borrower based on the picture (similarly to Pope and Sydnor 2011). However, given the high degree of disagreements across raters we decided not to use these measures in our analyses.

Page 15: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

15

differences in the economic environment that might have affected the borrower. We grouped the

borrowers’ states of residency into eight groups based on the Bureau of Economic Analysis

classification, and added a special armed forces group for Military personnel serving overseas.

Lastly, we included the final interest rate for each loan as a predictor in our model.6 Arguably, in

a financially efficient world, this final interest rate, which was determined using a bidding

process, should reflect all the information available to lenders (including the textual

information). However, it is possible that Prosper’s bidding mechanism allows for some strategic

behavior by sophisticated lenders, thus not fully reflecting a market efficient behavior (Chen,

Ghose, and Lambert 2014). Nevertheless, our models test whether the text is predictive over and

beyond the final interest rate. Table 1 presents summary statistics for the variables in our model.

*** Insert Table 1 about here ***

PREDICTING DEFAULT

Predictive Model (Stacking Ensemble)

Our objective in this section is to evaluate whether the text borrowers write in their loan

request is predictive of their loan default up to three years post origination. In order to do so, we

need to first build a strong benchmark—a powerful predictive model that includes the

traditionally used financial and demographics information and maximizes the chances of

predicting default using these variables. Second, we need to account for the fact that our model

may include a very large number of predictors (over one thousand bi-grams). In evaluating a

predictive model, it is common to compare alternative predictive models and choose the model

that best predicts the desired outcome—loan repayment in our case. From a purely predictive

6 An alterantive measure would be the maximum interest rate proposed by the borrower. However, because this measure is highly correlated the final lender rate, we include only the later in the model.

Page 16: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

16

point of view, a better approach, commonly used in machine learning, is to train several

predictive models and rather than choose the best model, create an ensemble or stack the

different models. An ensemble of models benefits from the strength of each individual model

and, at the same time, reduces the variance of the prediction. Accordingly, for the purpose of

leveraging the textual information to predict default, we apply that approach.

The stacking ensemble algorithm includes two steps. In the first step, we train each model

on the calibration data. Because of the large number of textual variables in our model, we

employ a simultaneous variable selection and model estimation in the first step. In the second

step, we build a weighting model to optimally combine the models calibrated in the first step.

We consider five types of models in the first step. The models vary in terms of the

classifier used and the approach to model variable selection. The five models include two logistic

regressions and three versions of decision tree classifiers.7

Regularized Logistic Regressions (L1 and L2 Regularization). The two logistic

regressions are L1 and L2 regularization logistic regressions. The penalized logistic regression

likelihood is:

𝐿(𝑌|𝛽, 𝜆) = ∑ (𝑦@ log.𝑝(𝑋@|𝛽)1 + (1 −E@FG 𝑦@) log.1 − 𝑝(𝑋@|𝛽)1 − 𝜆𝐽(𝛽),

where 𝑌 = {𝑦G, … , 𝑦E} is the set of binary outcome variables for n loans (loan repayment),

𝑝(𝑋@|𝛽) is the probability of repayment based on the logit model, where 𝑋@ is a vector of textual,

financial and demographic predictors for loan t, 𝛽 are a set of predictors’ coefficients, l is a

tuning penalization parameter to be estimated using cross-validation on the calibration sample,

and 𝐽(𝛽) is the penalization term. The L1 and L2 models differ with respect to the functional

7 We also considered a forth type of decision tree (AdaBoost) as well as a Support Vector Machine classifer but dropped them due to poor performance on our data.

Page 17: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

17

form of the penalization term, 𝐽(𝛽). In L1, 𝐽(𝛽) = ∑ |𝛽L|MLFG , while in L2, 𝐽(𝛽) = ∑ 𝛽L

NMLFG , where

k is the number of predictors. Therefore, L1 is the Lasso regression penalty and L2 is the ridge

regression penalty. Whereas L1 tends to shrink many of the regression parameters to exactly zero

and leave other parameters with no shrinkage, L2 tends to shrink many parameters to small but

non-zero values. Before entering the variables into the L1 and L2 regression we standardize all

variables (Tibshirani 1997).

Tree-based Methods (Random forest and Extra Trees). There are three tree-based

methods in the ensemble. We estimate two different Random Forest models, one with variance

selection and the second with best feature selection as well as Extremely Randomized Trees

(Extra Trees). Both models combine many decision trees, thus, each of these tree-based methods

is an ensemble in and of itself, and each tree is chosen to resolve misclassification of previously

included trees. The Random Forest randomly draws with replacements subsets of the calibration

data to fit each tree, and a random subset of features (variables) is used in each tree. In the

Variance Selection Random Forest features are chosen based on a variance threshold determined

by cross validation (80/20 split). In the K-Best Feature Selection Random Forest features are

selected based on a 𝜒N test. That is, we select the K-features with the highest 𝜒N score. We use

cross-validation (80/20 split) to determine the value of K. The Random Forest approach

mitigates the problem of over-fitting in traditional decision trees. The Extra Trees is an extension

of the Random Forest in which the cut-off point (the split) for each feature in the tree are also

chosen at random (from a uniform distribution) and the best split among them is chosen. Due to

the size of the feature space, we first apply a K-Best Feature Selection as described above to

select the features to be included in the Extra Trees (see Web Appendix for further details).

We used the scikit learn package in Python (http://scikit-learn.org/) to implement the five

Page 18: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

18

classifiers on a random sample of 80% of the calibration data. For the logistic regressions, we

estimated the 𝜆 penalization parameter by grid search using a 3-fold cross validation on the

calibration sample. For the tree-based methods, to limit over-fitting of the trees, we randomized

the parameter optimization (Bergstra and Bengio 2012) using a 3-fold cross validation on the

calibration data to determine the structure of the tree (e.g., number leaves, number of splits,

depth of the tree, and criteria). We use a randomized parameter optimization rather an exhaustive

search (or a grid search) due to the large number of variables in our model. The parameters are

sampled from a distribution (uniform) over all possible parameter values.

Model Stacking and Predictions. In the second step, we estimate the weights for each

model to combine the ensemble of models using the remaining 20% of the calibration data. We

use a simple binary logistic model to combine the different predictive models. Thoughother

classifiersmaybeused,alogisticbinaryregressionmetaclassifierhelpsavoidoverfitting

andoftenresultsinsuperiorperformance (Whalen and Gaurav 2013). In our binary logistic

regression model, repayment is the dependent variable and the probabilities of repayment for each

loan by each of the five models in the ensembles from step 1 (the two logistic regularization

regressions and the three decision trees methods) as predictors. The estimated parameters of the

logistic regression provide the weights of each individual model in the ensemble. Specifically, the

ensemble repayment probability for loan j can be written as:

𝑝.𝑟𝑒𝑝𝑎𝑦𝑚𝑒𝑛𝑡#1 = exp.𝒙𝒋′𝒘1 /(1 + exp.𝒙𝒋′𝒘1),

where 𝒙𝒋 is the vector of repayment probabilities for each model s— 𝑝.𝑟𝑒𝑝𝑎𝑦𝑚𝑒𝑛𝑡#]𝑚𝑜𝑑𝑒𝑙^1

from step 1, and 𝒘are the estimated weights of each model in the logistic regression classifier.

We estimated an ensemble of the aforementioned five models. We find the following

weights for the different model: L1 = 0.040, L2 = 0.560, Random Forest K Best = 0.218, Random

Page 19: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

19

Forest Variance Select = 0.116, and Extra Trees = 0.066. Thus, for our application we find that

the L2 model contributes most to the ensemble. We also estimated five predictive models, each

based on one specific models in the ensemble (L1 and L2 regularization logistic regressions, the

two Random Forest models and the Extra Trees model). We find that for each of the individual

predictive models the textual information significantly improves predictions in the validation

sample over and beyond the financial and demographics. However, the stacking ensemble model

had the best predictive ability (see Table A2 in the Web Appendix for predictive ability of each

individual models). Accordingly, we conclude that it is the textual information rather than the

specific machine learning predictive model used that is responsible for the improved predictive

ability of the model that include the textual information (Banko and Brill 2001). Next, we

describe the predictions based on the ensemble.

To test whether the text borrowers wrote in their loan requests is predictive of future

default, we use a 10-fold cross validation. We randomly split the loans into 10 equally sized

groups, calibrate the ensemble algorithm on nine groups and predict the remaining group. To

evaluate statistical significance, we repeated the 10-fold cross validation 10 times, using different

random seeds at each iteration. By cycling through the 10 groups and averaging the prediction

results across the 10 cycles and 10 replications we get a robust measure of 100 predictions.

Because there is no obvious cut-off for a probability from which one should consider the loan as

defaulted, we use the “area under the curve” (AUC) of the Receiver Operating Characteristic

(ROC) curve, a commonly used measure for prediction accuracy of binary outcomes. We further

report the Jaccard index (e.g., Netzer et al. 2012) of loan default, which is defined as the number

of correctly predicted defaulting loans divided by the total number of loans that were defaulted

but we incorrectly predicted, loans that were predicted to default but were repaid, and correctly

Page 20: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

20

predicted defaulted loans. This gives us an intuitive measure of hit rates of defaulting loans

penalized for erroneous predictions of both type I and type II errors. Finally, building on research

that shows FICO scores have a lower predictive power of financial behavior for people with low

scores (Avery et al. 2000), we report predictions for high (AA, A), medium (B, C), and low (D,

E, HR) credit grades (while controlling for credit grades within each group).

We compare three versions of the ensemble: (1) a model calibrated only on the financial

and demographic data; (2) a model that includes just the textual information and ignores the

financial and demographic information, and (3) a model that includes financial and demographic

information together with the textual data. Comparing models (2) and (3) provides the

incremental predictive power of the textual information over predictors commonly used in the

financial industry. Comparing models (1) and (2) informs the degree of predictive information

contained in the textual information relative to the financial and demographic information.

Prediction Results

Table 2 details the average results of the 10-fold cross validation across 10 random

shuffling of the observations. The results we present are clean out-of-sample validation because

in each fold we calibrate feature selection, model estimates, and the ensemble weights on 90% of

the data and leave the remaining 10% of the data for validation. Table 2 presents the Area Unver

the ROC Curve (or AUC) and the Jaccard Index prediction measures and Figure 1 shows the

AUC graphically. Specifically, Figure 1 depicts the average ROC curve with and without the

textual data for one randomly chosen 10-fold cross validation. A better predictive model is the

one with a ROC curve that is closer to the upper left corner of the graph. The AUC of the model

with textual, financial, and demographics information is 2.89% better than the AUC of the model

with only financial and demographics information. This difference is statically significant. In

Page 21: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

21

fact, the model with both textual and financial and demographics information has higher AUC in

all 100 replications of the cross-validation exercise. Breaking the sample by credit grade, we note

that the textual information significantly improves predictions across all credit grade levels.

However, the textual information is particularly useful in improving default predictions for

borrowers with low credit levels. This result is consistent with Avery et al. (2000) who find

FICO scores to be least telling of people’s true financial situation for those with low scores.

Interestingly, if we were to ignore the financial and demographic information and use

only the borrower textual information, we obtain an AUC of 66.68% compared to an AUC of

70.52% for the model with only financial and demographic information. That is, a brief,

unverifiable, “cheap talk” (Farrell and Rabin 1996), textual information provided by borrowers is

nearly as predictive as the traditional financial and demographic information. This result is

particularly impressive given the tremendous effort and expenditure involved in collecting the

financial information relative to the simple method used to collect the textual information. This

result may also suggest that textual information may be particularly useful in “thin file”

situations, where the financial information about consumers is sparse (Coyle 2017).

*** Insert Table 2 and Figure 1 about here ***

We conducted a back-of-the-envelope calculation to quantify the managerial relevance and

financial implications of the improvement in predictive ability offered by the textual data. For each

of the 18,312 granted loans we calculated the expected profit from investing $1,000 in each loan

based on the models with and without text. In calculating the expected profit, we assume that

borrowers who default repay on average 25% of the loan before defaulting (based on estimates

published by crowdfunding consulting agencies). The expected profits for loan j is:

𝐸.𝑝𝑟𝑜𝑓𝑖𝑡#1 = a1 − 𝑃𝑟𝑜𝑏.𝑟𝑎𝑝𝑦𝑚𝑒𝑛𝑡#1d × a−0.75 × 𝑎𝑚𝑜𝑢𝑛𝑡_𝑔𝑟𝑎𝑛𝑡𝑒𝑑# + 0.25 ×𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡_𝑒𝑎𝑟𝑛𝑒𝑑d + [𝑃𝑟𝑜𝑏.𝑟𝑒𝑝𝑎𝑦𝑚𝑒𝑛𝑡#1] × 𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡_𝑒𝑎𝑟𝑛𝑒𝑑,

Page 22: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

22

where 𝑃𝑟𝑜𝑏.𝑅𝑒𝑝𝑎𝑦𝑚𝑒𝑛𝑡#1 is the probability of repayment of loan j based on the corresponding

model (with or without the text), 𝑎𝑚𝑜𝑢𝑛𝑡_𝑔𝑟𝑎𝑛𝑡𝑒𝑑# is the principal amount the lender grants for

loan j ($1,000 in our case), and 𝐼𝑛𝑡𝑒𝑟𝑒𝑠𝑡_𝑒𝑎𝑟𝑒𝑛𝑒𝑑# is the interest rate paid to the lender based on

loan jth final interest rate, over three years for repaid loans and over three quarters of a year for

defaulted loans (for simplicity we do not time-discount payments in years 2 and 3). For each of

the two policies (based on the model with and without text) we sort the loans based on their

expected profit and select the top 1,000 loans with the highest expected return for each policy.

Finally, we calculate the actual profitability of each lending policy based on the actual default of

each loan in the data to calculate the return on the investment on the million dollars (1,000 loans

time $1,000 per loan). We find that the investment policy based on the model with the textual data

returns $96,528 more than the policy based on the financial and demographics information only.

This is an increase of 9.65% in the return on investment (ROI) on the million dollars invested.

Thus, while the improvement in default prediction for the model with the textual information

might seem modest (nearly 3%) even though it is statistically significant, the improvement in ROI

based on the textual information, is substantial and economically meaningful.

To summarize, the text borrowers write in their loan request can significantly help predict

loan default even when accounting for financial and demographic measures. The ensemble-based

predictive model was chosen to maximize predictive ability, but it provides little to no

interpretation of the parameter estimates, words, and topics that predict default. Therefore, in the

second part of this paper we present a series of analyses that shed light on the words and writing

styles that were most likely to appear in defaulting loans. These analyses also demonstrate how

machine learning approaches can be used beyond predictions and towards understanding of what

factors affect the improvement in prediction (Hofman, Sharma, and Watts 2017).

Page 23: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

23

WORDS, TOPICS, AND WRITING STYLES THAT ARE ASSOCIATED WITH DEFAULT

The result that text has a predictive ability similar in magnitude to the predictive ability of

financial and demographic information is perhaps surprising. However, this result is consistent

with the idea that people who differ in the way they think and feel also differ in what they say and

write about those thoughts and feelings (Fast and Funder 2008; Hirsh and Peterson 2009). We

employed four approaches to uncover whether words, topics, and writing styles of defaulters

differ from those who repaid their loan. (1) We use a naïve Bayes classifier to identify the words

or bi-grams that most distinguish defaulted from fully-paid loans. The advantage of the naïve

Bayes is in providing intuitive interpretation of the words that are most discriminative between

defaulted and repaid loans; however, its disadvantage is that it assumes independence across

predictors and therefore cannot control for the financial and demographics variables (or for the

dependence among the textual variables). (2) To alleviate this concern, we use a logistic

regression with L1 penalization to uncover the words and bi-grams that are associated with

default after controlling for the financial and demographic information. The L1 regression results

corroborate the Naïve Bayes findings (correlation between the two analyses of results is 0.582, p

< 0.01. See details of the L1 regression in Table A3 in the Web Appendix). (3) To look beyond

specific words or bi-grams and into the topics discussed in each loan and writing styles employed,

we use a latent Dirichlet allocation (LDA) analysis. (4) Finally, relying on a well-known

dictionary, the Linguistic Inquiry and Word Count (LIWC; Tausczik and Pennebaker 2010), we

identify the writing styles that are most correlated with defaulting or repaying the loan.

Words that Distinguish between Loan Requests of Paying and Defaulting Borrowers

To investigate which words in the loan application most discriminate between borrowers

who default and borrowers who repay the loan in full, we ran a multinomial naïve Bayes classifier

Page 24: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

24

using the Python scikit-learn 3.0 package on bi-grams (all possible words and pairs of words) that

appeared in at least 400 loans (1,032 bi-grams). The classifier uses Bayes rule and the assumption

of independence among words to estimate each word’s likelihood of appearing in defaulted and

paid loans. We then calculate the most “informative” bi-grams in terms of discriminating between

defaulted and repaid loans by calculating the bi-grams with the highest ratio of P(bi-

gram|defaulted)/P(bi-gram|repaid) and the highest ratio of P(bi-gram|repaid)/P(bi-gram|defaulted).

Table A4 in the Web Appendix presents the lists of words and their likelihood of

appearing in repaid versus defaulted loan requests (Table A4a) or defaulted versus repaid loan

requests (Table A4b). Figures 2 and 3 present word clouds of the naïve Bayes analysis of bi-

grams in their stemmed form. The size of each bi-gram in Figures 2 and 3 corresponds to the

likelihood that the bi-gram will be included in a repaid loan request versus a defaulted loan

request and in a defaulted loan request versus a repaid loan request, respectively. For example,

the word “reinvest” in Figure 2 is 4.8 times more likely to appear in a fully paid than a defaulted

loan request, while the word “god” is 2.0 times more likely to appear in a defaulted than a fully

paid loan request. The central cloud in each figure presents the most discriminant bi-grams

(cutoff ratio = 1.5) and the satellite clouds represent emerging themes based on our grouping of

these words.

*** Insert Figures 2, 3 around here ***

Several insights can be gained from this analysis. Relative to defaulters, borrowers who

paid in full were more likely to include in their loan application: (i) Words associated with their

financial situation such as “reinvest,” “interest,” and “tax;” (ii) Words that may be a sign of

projected improvement in financial ability: “graduate,” “wedding,” and “promote;” (iii) Relative

words such as “side,” “rather,” and “more than;” (iv) Long-term time related words such as

Page 25: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

25

“future,” “every month,” and “few years;” (v) “I” words such as “I’d,” “I’ll,” “I’m.” The above

indicates that borrowers who paid in full may have nothing to hide, have a brighter future ahead

of them, and generally are truthful. The latter insight is based on research showing the use of

relative and time words as well as first person “I” words is associated with greater candor

because honest stories are usually more complex and personal (Newman et al. 2003). Dishonest

stories, on the other hand, are simpler, allowing the lying storyteller to conserve cognitive

resources in order to focus on the lie more easily (Tausczik and Pennebaker 2010). Borrowers

who repaid their loan also used words that indicate their financial literacy (e.g., “reinvest,” “after

tax,” and “minimum payment”). Indeed, higher financial literacy has been associated with lower

debt (Brown et al. 2015).

Turning to Figure 3, not surprisingly, borrowers who defaulted were more likely to

mention words related to (i) Financial hardships (“payday loan,” “child support,” and

“refinance,”) and general hardship (“stress,” “divorce,” and “very hard”). This result is in line

with Herzenstein, Sonenshein, and Dholakia (2011) who found that discussing personal hardship

in the loan application is associated with borrowers who are late on their loan payments. (ii)

Explaining their situation (“loan explain,” “explain why,”) and discussing their work state (“hard

work,” “worker”). Providing explanations is often connected to past deviant behavior

(Sonenshein, Herzenstein, and Dholakia 2011). (iii) Appreciative and good-manner words toward

lenders (“god bless,” “hello”) and pleading lenders for help (“need help,” “please help”). Why is

polite language more likely to appear in defaulted loan requests? One possibility is that it is not

authentic. Indeed, a recent study shows that rude people are more trustworthy because their

reactions seem more authentic (Feldman et al. 2017). (iv) Referring to external sources such as

“god,” “son,” “someone.” The strong reference to others has been shown to exist in deceptive

Page 26: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

26

language style. Liars tend to avoid mentioning themselves, perhaps to distance themselves from

the lie (Hancock et al. 2007; Newman et al. 2003). With respect to the frequent mention of God in

defaulting loans, Kupor, Laurin, and Levav (2015), find that reminders of God can increase the

likelihood of people engaging in riskier behaviors, alluding to the possibility these borrowers took

a loan they were unable to repay. (v) Time related words (“total monthly,” “day,”) and future

tense words (“would use,” “will able”). While both paying and defaulting borrowers use time

related words, defaulters seem to focus on the shorter term (a month) while repayers on the longer

term (a year). This result is consistent with Lynch et al. (2010), who showed that long-term

planning (as opposed to short planning) was associated with lower procrastination and higher

degree of assignment completion. Further, the degree of long-term planning was associated with

FICO scores. The mention of shorter horizon time words by defaulters is also consistent with

Shah, Mullainathan, and Shafir (2012) who find that financial resource scarcity leads people to

shift their attention to the near future, neglecting the distant future, hence leading to over-

borrowing. The above words suggest that defaulting borrowers attempted to garner empathy,

seem forthcoming and appreciative, but when the time to repay the loan came they were unable to

escape their reality. We note that the themes most prevalent in defaulting loan requests construct a

narrative that is very similar to the “Nigerian email scam”, as described on the Federal Trade

Commission’s website: “Nigerian email scams are characterized by convincing sob stories,

unfailingly polite language, and promises of a big payoff.”8

While the naïve Bayes analysis is informative with respect to identifying words that are

associated with loan default, for a practical use, we may wish to uncover the words and financial

or demographic variables that are most predictive of default. To that end, we analyzed the

8 We thank Professor Gil Appel for this neat observation.

Page 27: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

27

variables with the highest importance in predicting repayment based on the Random Forest

model used in the ensemble learning model (see Table A5 in the Web Appendix for the list of

120 variables that are most predictive of default). Not surprisingly, we find that the financial

variables such as lender rate and credit score are most predictive of default. In terms of the words

that are most predictive of default in the Random Forest, we find high degree of agreement

between this analysis and the naïve Bayes Analysis. For example, the words “payday loan,”

“invest,” “hard,” “thank you,” “explain,” “student,” were informative in both analyses.

The Relationship between Words that are Associated with Loan Default and Words that are

Associated with Loan Funding

In assessing which words are associated with default we focused only on funded loan.

However, it is possible that some borrowers strategically use some of these words to convince

lenders to fund their loans. The purpose of the following analysis is to examine whether lenders

are aware of such strategic behavior, and if not, which words sipped through lenders’ defenses

and got them to fund overly-risky loans that eventually defaulted. For example, is the common

use of polite language such as “thank you,” or “hello” in defaulted loans intended to convince

lenders to fund them? To investigate the relationship between words that were associated with

loan default and words that were related with loan funding, we ran a naïve Bayes analysis on the

entire set of loans requests (funded and unfunded requests, 122,479 loan requests in total),

assessing the bi-grams with highest ratio of P(bi-gram|funded)/P(bi-gram|unfunded) and the

highest ratio of P(bi-gram|unfunded)/P(bi-gram|funded). See Table A6 in the appendix for

summary statistics of this dataset and Table A7 for the naïve Bayes analysis results.

Figure 4 depicts for each bi-gram its value on the ratio P(bi-gram|defaulted)/P(bi-

gram|repaid) versus its value on the ratio P(bi-gram|unfunded)/P(bi-gram|funded). We named a

Page 28: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

28

few representative bi-grams. A high correlation between the two ratios (P(bi-

gram|defaulted)/P(bi-gram|repaid) and P(bi-gram|unfunded)/P(bi-gram|funded) means that

lenders are largely aware of the words that are associated with default. Results show a fairly

strong correlation between the two ratios (r = 0.446, p < 0.01), suggesting that lenders are at least

somewhat rational when interpreting and incorporating the text in their funding decisions.

Examining the results more carefully, we find roughly three types of words: (1) words that have

high likelihood of default and low likelihood of funding or low likelihood of default and high

likelihood of funding (e.g., “chance,” and “bonus;”) (2) words that have a stronger impact on

loan funding than on loan default (e.g., words related to explanation, such as “loan explain,” and

“situation explain,”) and (3) words that “tricked” lenders to fund loans that were eventually

defaulted (words that were related to default more than to loan funding). The most typical words

in this category are “god” and “god bless”.

*** Insert Figure 4 around here ***

Analyzing the Topics Discussed in Each Loan Request and Their Relationship to Default

The Naïve Bayes analysis allowed meaningful insights into the discriminative power of

specific words or bi-grams between repaid and defaulted loans. In Figures 2 and 3 we grouped

the individual words into topics based on our interpretation and judgment. However, several

machine learning methods have been proposed to statistically combine words into topics based

their common co-occurrence in documents. Probably the most commonly used topic modeling

approach is the latent Dirichlet allocation analysis (LDA; Blei, Ng, and Jordan 2003). We apply

the LDA approach on the complete dataset of all loan requests (122,479).

We use the online variational inference algorithm for the LDA training (Hoffman, Bach,

and Blei 2010), following Griffiths and Steyvers (2004)’s settings and priors. We used the 5,000

Page 29: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

29

word stems that appeared most frequently across loan requests. Eliminating infrequent words

mitigates the risk of rare-words occurrences and co-occurrence confounding the topics. Because

the LDA analysis requires the researcher to determine the number of topics to be analyzed, we

varied the number of topics between two and 30, and used model fit (the perplexity measure), to

determine the final number of topics. We find that the model with seven topics had the best fit

(lowest perplexity). Table 3 presents the seven topics and the most representative words for each

topic based on the relevance score of 0.5 (see Sievert and Shirley (2014) for details of the

relevance score), Table A8 in the Web Appendix has a more detailed list of the top 30 words for

each topic, and Figure A1 in the Web Appendix presents the perplexity analysis.

The topics we identify relate to the reason to request the loan, life circumstances, or

writing style. We find three loan purpose topics: Employment and School, Interest Rate

Consolidation, and Business and Real Estate. The other four topics are related to life

circumstances and writing style: Expenses Explanation, Family, Loan Details Explanation, and

Monthly Expenses. The monthly expenses topics are most likely related to a set of expenses

Prosper recommended borrowers mention as part of their loan request during our data period.

To relate the topics mentioned in each loan request’s text to the likelihood of default, we

ran a binary logistic regression with loan repayment = 1 and default = 0 as the dependent

variable, the probability of each topic appearing in the loan based on our LDA analysis (the topic

loan details explanation serves as benchmark), and the same set of textual information metrics as

well as the financial and demographic variables used in the ensemble learning model described

earlier. Table 4, presents the results of the binary logistic regression with the LDA topics. Before

we discuss the results related to the LDA topics we note that our controls, the financial and

demographic variables, are significant and in the expected direction. That is, repayment

Page 30: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

30

likelihood is increasing as credit grades improve, but decreasing with debt to income ratio, home

ownership, loan amount, and lender rate. The strong relationship between lender rate and

repayment suggests some level of efficiency among Prosper lenders.

*** Insert Tables 3, 4 around here ***

Relative to the topic Loan Detail Explanation, we find that topics of Employment and

School, Interest Rate Reduction, and Monthly Payment are more likely to appear in repaid loan

requests. The Family topic, on the other hand, was less likely to appear in repaid loans.

Consistent with the naïve Bayes analysis we find that the topic of explaining one’s financials and

loan motives, and referring to family are associated with lower repayment likelihood. We also

find that tendency to detail the monthly expenses and financials is associated with higher

likelihood of repayment, perhaps because providing such information is indeed truthful and

forthcoming (having nothing to hide). Finally, although not the purpose of the LDA analysis, we

find that the binary logistic model that includes the textual information via LDA in addition to

the demographic and financial information fits the data and predicts default better than a model

that does not include the LDA topic probabilities (see Web Appendix for details).

In sum, multiple methods converged to uncovered themes and words that differentiate

defaulted from paid loan requests. Next, using a well-researched dictionary we explore whether

borrowers’ writing styles can shed light on the traits and states of defaulted borrowers.

Circumstances and Personalities of Those Who Defaulted

In this section we rely on one of the more researched and established text analysis

methods, the Linguistic Inquiry and Word Count (LIWC) dictionary. This dictionary groups

almost 4,500 words into 64 linguistic and psychologically meaningful categories such as tenses

(past, present, future), forms (I, we, you, she or he), social, positive, and negative emotions.

Page 31: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

31

Since its release in 2001, many researchers have examined and employed it in their research (see

Tausczik and Pennebaker (2010) for a comprehensive overview). The result of almost two

decades of research is lists of word categories that represent the writing style of people with

different personalities (Kosinski, Stillwell, and Graepel 2013; Pennebaker and King 1999;

Schwartz et al. 2013; Yarkoni 2010), mental health states (Preotiuc-Pietro et al. 2015), emotional

states (Pennebaker, Mayne, and Francis 1997), as well as many other traits and states.

We chose LIWC over other dictionaries because it is context-free. Other dictionaries

(e.g., DICTION and Loughran and McDonald (2011)) have been used to interpret financial

related text, such as quarterly and annual financial reports. However, because we are interested

in individuals, rather than firms, describing their finances these dictionaries are less appropriate

for our data. Another recent dictionary, developed by Schwartz et al. (2013), is also inadequate

for our purposes because it is based on Facebook (and later used successfully on Twitter;

Preoţiuc-Pietro et al. 2015) and thus includes many items related to partying, drinking, and

swearing—words that rarely appear in the text we analyze.

LIWC is composed of sub-dictionaries. The same word can appear in several sub-

dictionaries. We first calculate the proportion of stemmed words in each loan request that belong

to each of the 64 dictionaries.9 We then estimated a binary logit model (load repaid = 1 and loan

defaulted = 0) to relate the proportions of words in each loan that appear in each dictionary to

whether the loan was repaid, controlling for all financial and demographic variables used in our

previous analyses. Results are presented in Table 5.

*** Insert Table 5 around here ***

We begin by noting that all financial and demographic control variables are in the

9 For this analysis we did not remove words with less than three characters and infrequent words as we are matching words to pre-defined dictionaries.

Page 32: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

32

expected direction and are consistent with our LDA results. Fourteen of the sixty-four LIWC

dictionaries were significantly related to repayment behavior. Several of them corroborate our

results from the naïve Bayes and LDA analyses. To interpret our findings, we rely on previous

research that leveraged the LIWC dictionary, which allow us to conclude that defaulted loan

requests contain words that are associated with the writing style of liars and of extroverts.

We begin with deception. Looking at Table 5, we see that the following LIWC sub-

dictionaries that have been shown to be associated with greater likelihood of deception, are

associated in our analysis with greater likelihood to default: (1) present and future tense words.

This result is similar to our naïve Bayes findings. Indeed, past research shows liars are more

likely to use present and future tense words because they represent unconcluded situations

(Pasupathi 2007); (2) higher use of motion words (e.g., “drive,” “go,” and “run,”) which have

been associated with lower cognitive complexity (Newman et al. 2003), and lower use of relative

words (e.g., “closer,” “higher,” and “older,”) which are associated with higher complexity

(Pennebaker and King 1999). Deceptive language has been shown to include more motion words

and fewer relative words because it is less complex in nature; (3) Similar to our finding from the

naïve Bayes analysis that defaulters tend to refer to others, we find that social words (e.g.,

“mother,” “father,” “he,” “she,” “we,” and “they”) are associated with higher likelihood of

default. Along these lines, Hancock et al. (2007) showed that linguistic writing style of liars is

reflected by lower use of first person singular and higher use of first person plural such as “we”

(See also Bond and Lee 2005; Newman et al. 2003). We note that in the context of hotel reviews,

Ott, Cardie, and Hancock (2012) find higher use of “I” in fake reviews possibly because they did

not have much to write about the hotel itself (because they have never been there) so they

described their own activities. Our Naïve Bayes finding shows more “I” words in repaid loans;

Page 33: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

33

(4) time words (e.g., “January,” “Sunday,” “morning,” and “never”) and space words (e.g.,

“above,” “inch,” and “north”) were associated with higher likelihood of default. These words

have been found to be prevalent in deceptive statements written by prisoners because the use of

such words seem to draw attention away from the self (Bond and Lee 2005).

Taken together, we find that several of the LIWC dictionaries that have been previously

found to associate with deception are also negatively associated with loan repayment (positively

associated with loan default). But, do borrowers intentionally lie to lenders? One possibility is

that people can predict, with some accuracy, future default, months, or even years ahead. If so,

this would support the idea that defaulters are writing either intentionally or unconsciously to

deceive lenders. However, there is another option: borrowers on Prosper.com might genuinely

believe that they will be able to pay the borrowed money in full. Indeed, past research suggests

that people are often optimistic of future outcomes (Weinstein 1980). What there may be hiding

from lenders is the degree of their difficult situations and circumstances.

Our second observation is that the sub-dictionaries associated with the writing style of

extroverts are also associated with greater likelihood to default. Extroverts have been shown to

use more religious and body related words (e.g., “mouth,” “rib,” “sweat,” and “naked;” Yarkoni

2010), social and humans words (e.g., “adults,” “boy,” and “female;” Hirsh and Peterson 2009;

Pennebaker and King 1999; Schwartz et al. 2013; Yarkoni 2010), motion words (e.g., “drive,”

“go,” and “run;” Schwartz et al. 2013), achievement words (e.g., “able,” “accomplish,” and

“master,”) and fewer filler words (e.g., “blah” and “like;” Mairesse et al. 2007)—all of which are

significantly related to a greater likelihood of default in our analysis (see Table 5).

The finding that defaulters are more likely to exhibit writing style of extroverts is

consistent with research showing that extroverts are more likely to take risks (Nicholson et al.

Page 34: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

34

2005), engage in compulsive buying of lottery tickets (Balabanis 2002), and are less likely to

save (Brandstätter 2005; Nyhus and Webley 2001). Moreover, it is not a coincidence that the

fourteen LIWC dictionaries that were significantly correlated with default are correlated with

both extroversion and deception. Past literature has consistently documented that extroverts are

more likely to lie, and not only because they talk to more people but rather because these lies

help smooth their interactions with others (Weiss and Feldman 2006).

We did not find consistent and conclusive relationship between the LIWC dictionaries

associated with each of the other big five personality traits and loan repayment. Similarly, results

from other research on the relationship between LIWC and gender, age, mental, and emotional

states did not consistently relate to default in our study. Finally, we acknowledge that there may

be variables that are confounded with both the observable text and unobservable personality traits

or states that are accountable for the repayment behavior. Nevertheless, from a predictive point of

view, we find that the model that includes the LIWC dictionaries fits the data and predicts default

better than a model that does not include the textual information (see Web Appendix for details).

GENERAL DISCUSSION

The words we write matter. Aggregated text has been shown to predict market trends

(Bollen, Mao, and Zeng 2011) and stock market behavior (Tirunillai and Tellis 2012), market

structure (Netzer et al. 2012), virility of news articles (Berger and Milkman 2012), prices of

services (Jurafsky et al. 2014), and political elections (Tumasjan et al. 2010). At the individual

text writer level, text has been used to evaluate the state of mind of writers (Ventrella 2011), to

identify liars (Newman et al. 2003) and fake reviews (Ott, Cardie, and Hancok 2012), to assess

personality traits (Schwartz et al. 2013; Yarkoni 2010), and the mental state of those who Tweet

(Preotiuc-Pietro et al. 2015). In this paper, we show that text has the ability to predict financial

Page 35: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

35

behavior of its writer in the distant future with significant accuracy.

Using data from an online crowdfunding platform we show that incorporating the text

borrowers write in their loan application into traditional models that predict loan default based on

financial and demographic information about the borrower significantly and substantially

increases their predictive ability. Using machine learning methods such as naïve Bayes analysis,

regularized regression, and LDA analysis, we uncover the words and topics borrowers often

include in their loan request. We find that at loan origination, defaulters used simple but wordier

language, wrote about hardship, further explained their situation and why they need the loan, and

tended to refer to other sources such as their family, God, and chance. Building on past research

and the commonly used LIWC dictionary we infer that defaulting borrowers write similarly to

people who are extroverts and to those who lie. These results were obtained after controlling for

the borrower’s credit grade, which should capture the financial implications of the borrower’s

life circumstances, and the interest rate given to the borrower, which should capture differences

in the risk of different types of loans and borrowers. Simply put, we show that borrowers,

consciously or not, leave traces of their intentions, circumstances, and personality in the text they

write when applying for a loan—a sort of online “involuntary sweat”.

Theoretical and Practical Contribution

Our research makes the following theoretical and practical contributions. First, our work

contributes to the recent but growing marketing literature on uncovering behavioral insights on

consumers from the text they write and the traces they leave on social media (Humphreys and Jen-

Hui Wang 2018; Matz and Netzer 2017). We demonstrate that text consumers write at loan

origination is indicative of their states and traits, and predictive of their future repayment behavior.

In an environment characterized by high uncertainty and high stakes, we find that verifiable and

Page 36: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

36

unverifiable data have similar predictive ability. While borrowers can truly write whatever they

wish in the textbox of the loan application—supposedly “cheap talk”—their word usage is

predictive of future repayment behavior at a similar scale as their financial and demographic

information. This finding implies that whether it is intentional and conscious or not, borrowers’

writings seem to disclose their true nature, intentions, and circumstances. This finding contributes

to the literature on implication and meaning of word usage by showing that people with different

economic and financial situations use words differently.

Second, we make an additional contribution to the text analytics literature. The text-

mining literature has primarily concentrated on predicting behaviors that occur at the time of

writing the text, such as lying about past events (Newman et al. 2003) fake review (Ott, Cardie

and Hancock. 2012), but not on predicting future behavior of writers. In one exception, Slatcher

and Pennebaker (2006) show that couples who used more positive emotion words when texting

to each other were more likely to continue dating three months later. The behavior we are

documenting often occurs years after the text was written.

Third, our approach to predicting default relies on an automatic algorithm that mines

individual words, including those without much meaning (e.g., articles and fillers), in the entire

textual corpora. Past work on narratives used to facilitate economic transaction (e.g., Chen, Yao,

and Kotha 2009; Herzenstein, Sonenshein, and Dholokia 2011; Martens, Jennings, and Jennings

2007), employed human coders and therefore is prone to human mistakes, is not scalable, which

limits its predictive ability and practical use. It may not be surprising that people who describe

hardship in the text at loan origination are more likely to default—as past research has shown.

However, our work finds that defaulters and repayers also differ in their usage of pronouns and

tenses, and those seemingly harmless pronouns have the ability to predict future economic

Page 37: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

37

behaviors. Research shows that unless agents are very mindful regarding their usage of

pronouns, it is unlikely that they noticed (not to mention planned) to use specific pronouns in

order to manipulate the reader (Pennebaker 2011)—lenders in our case.

Fourth, we provide evidence that our method of automatically analyzing free text is an

effective way of supplementing traditional measures and even replacing some aspects of the

human interaction of traditional bank loans. Textual information, such as the one we analyze, can

not only shed light on past behavior (which may be captured by measures such as FICO scores)

and provide to it context, but also provide information about the future, which is unlikely to be

captured by FICO scores and credit history reports. These future events may be positive (e.g.,

graduating and accepting a new job) or negative (e.g., impending medical cost), and certainly

affect lending decisions.

Furthermore, because lending institutions place a great deal of emphasis on the

importance of models for credit risk measurement and management, they have developed their

own proprietary models, which often have a high price tag due to data acquisition (Bloomberg

02/2013). Collecting text may be an effective and low-cost supplement to the traditional financial

data and default models. Such endeavors may be particularly useful in “thin file” situations

where historical data about customers’ finances is sparse. That being said, although our objective

in this research is to explore the predictive and informative value of the text in loan applications,

using such information to decide which borrowers should granted loans may carry both ethical

and legal considerations within the restrictions of this highly regulated industry.

Avenues for Future Research

Our research takes the first step in automatically analyzing text in order to predict default,

and therefore initiates multiple research opportunities. First, we focus on predicting default

Page 38: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

38

because this is an interesting behavior that is less idiosyncratic to the crowdfunding platform

whose data we analyze (compared with lending decisions). Theoretically, many aspects of loan

repayment behavior, which are grounded in human behavior (e.g., extroversion; Nyhus and

Webley 2001), should be invariant to the type of loan and lending platform, whereas other

aspects may vary by context. Our research should be extended to different populations, other

types of unsecured loans (e.g., credit card debt), and secured loans, (e.g., mortgages).

Second, our results should be validated in and extended to other types of media and

communication, such as phone calls or online chats. It would be interesting to test how an active

conversation—two-sided correspondence—versus only one-sided input as in our data may affect

the results. In our data, the content of the text is entirely up to borrowers—they disclose what

they wish in whichever writing style they choose. In a conversation, the borrower may be

prompted to provide certain information. Nevertheless, we predict that many of our findings will

endure because they are in the realm of pronouns whose usage is mostly subconscious.

Third, we document specific words and themes that might help lenders avoid defaulting

borrowers, and help borrowers better express themselves in requesting the loan. Based on the

market efficiency hypothesis, if both lenders and borrowers internalize the results we

documented, these results may change. In other words, the Lucas critique (Lucas 1976) that

historical data cannot predict a change in economic policy, because future behavior changes as

policy changes, may apply to our situation. We wish to emphasize two aspects. First, an

important objective of our study is to explore and uncover the behavioral signals or traces in the

text that are associated with loan default. This objective is descriptive rather than prescriptive

and hence not susceptible to the Lucas critique. Second, for the Lucas critique to manifest three

conditions need to occur: (1) the research finding need to fully disseminate, (2) the agent needs

Page 39: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

39

to have sufficient incentive to change her behavior, and (3) the agent needs to be able to change

her behavior based on the proposed findings (Van Herdee, Dekimpe, and Putsis 2005). With

respect to the first point, evidence from deception detection mechanisms (Li et al. 2014) as well

as macroeconomics (Rudebusch 2005) suggest that dissemination of research results rarely fully

affect behavior. Moreover, recall that borrowers in our context vary substantially in terms their

socio-economic and education level, which makes it unlikely that any published result will fully

change the behavior of all or even the majority of the agents. As for the second point, borrowers

indeed have a strong incentive to change their behavior due to the decision importance.

However, they might not necessarily be able to do that, which is related to the third point. Even

if our results are fully disseminated, people use of pronouns and tenses (the finding at the heart

of our research) has been found to be largely subconscious and changing it demands heavy

cognitive load (Pennabaker 2011). Nevertheless, we encourage future research to examine how

word usage of defaulters change over time.

Finally, while we are studying the predictive ability of written text regarding a particular

future behavior, our approach can be easily extended to other behaviors and industries. For

example, universities might be able to predict students’ success based on the text in the

application (beyond people manually reading the essays). Similarly, human resource departments

and recruiters can use the words in the text applicants write to identify promising candidates.

To conclude, borrowers leave meaningful signals in the text of loan applications that help

predict default, sometimes years post loan origination. Our research adds to the literature utilizing

text mining to better understand consumer behavior (Humphreys and Jen-Hui Wang 2018), and

especially in the realm of consumer finance (Herzenstein, Sonenshein, and Dholakia 2011).

Page 40: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

40

REFERENCES Agarwal, Sumit and Robert Hauswald (2010), “Distance and Private Information in Lending,”

Review of Financial Studies, 23(7), 2757-2788.

Agarwal, Sumit, Paige M. Skiba, and Jeremy Tobacman (2009), “Payday Loans and Credit Cards:

New liquidity and Credit Scoring Puzzles” American Economic Review, 99(2),412-17.

Amar, Moty, Dan Ariely, Shahar Ayal, Cynthia E. Cryder, and Scott I. Rick (2011), “Winning the

Battle but Losing the War: The Psychology of Debt Management,” Journal of Marketing

Research, 48(SPL), S38-S50.

Anderson, Jon, Stephen Burks, Colin DeYoung, and Aldo Rustichini (2011), “Toward the

Integration of Personality Theory and Decision Theory in the Explanation of Economic

Behavior.” Presented at the IZA workshop: Cognitive and non-cognitive skills.

Arya, Shweta, Catherine Eckel, and Colin Wichman (2013), “Anatomy of the Credit Score,”

Journal of Economic Behavior and Organization 95, 175-185.

Avery, Robert B., Raphael W. Bostic, Paul S. Calem, and Glenn B. Canner (2000), “Credit

Scoring: Statistical Issues and Evidence from Credit-Bureau Files,” Real Estate Economics,

28(3), 523-547.

Banko, Michele and Eric Brill (2001), “Scaling to Very Very Large Corpora for Natural Language

Disambiguation." In Proceedings of the 39th annual meeting on association for computational

linguistics, Association for Computational Linguistics, 26-33.

Barber, Brad M. and Terrance Odean (2001), “Boys will be Boys: Gender, Overconfidence, and

Common Stock Investment,” The Quarterly Journal of Economics, 116 (1), 261-292.

Berger Jonah, and Katherine L. Milkman (2012), “What Makes Online Content Viral?” Journal of

Marketing Research: 49 (2), 192-205.

Bergstra, James and Yoshua Bengio (2012), "Random Search for Hyper-Parameter

Optimization." Journal of Machine Learning Research, 13 (Feb), 281-305.

Berneth, Jeremy, Shannon G. Taylor, and Harvell Jackson Walker (2011), “From Measure to

Construct: An Investigation of the Nomological Network of Credit Scores” In Academy of

Management Proceedings, 1-6.

Blei, David M., Andrew Y. Ng, and Michael I. Jordan (2003), “Latent Dirichlet

Allocation,” Journal of Machine Learning Research, 3(Jan), 993-1022.

Bollen, Johan, Huina Mao, and Xiaojun Zeng (2011), “Twitter Mood Predicts the Stock Market,”

Page 41: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

41

Journal of Computational Science, 2 (1), 1-8.

Bond, Gary D. and Adrienne Y. Lee (2005), “Language of Lies in Prison: Linguistic Classification

of Prisoners’ Truthful and Deceptive Natural Language,” Applied Cognitive Psychology, 19 (3),

313-329.

Brandstätter, Hermann (2005), “The Personality Roots of Saving—Uncovered from German and

Dutch Surveys,” Consumers, Policy and the Environment, 65-87.

Brown, Meta, John Grigsby, Wilbert van der Klaauw, Jaya Wen, and Basit Zafar (2015), “Financial

Education and the Debt Behavior of the Young,” Federal Reserve Bank (New York), Report #634.

Chen, Ning, Arpita Ghosh, and Nicolas S. Lambert (2014), “Auctions for Social Lending: A

Theoretical Analysis,” Games and Economic Behavior, 86, 367-391.

Chen, Xiao-Ping, Xin Yao, and Suresh Kotha (2009), “Entrepreneur Passion and Preparedness in

Business Plan Presentations: A Persuasion Analysis of Venture Capitalists’ Funding Decisions,”

Academy of Management Journal, 52 (1), 199-214.

Coyle, Tim (2017), “The Thin Filed, Underbanked: An Untapped Source of Buyers,” Credit Union

Times Magazine, September 17 issue.

DePaulo, Bella M., James J. Lindsay, Brian E. Malone, Laura Muhlenbruck, Kelly Charlton, and

Harris Cooper (2003), “Cues to Deception,” Psychological Bulletin, 129 (1), 74-118.

Dholakia, Utpal, Leona Tam, Sunyee Yoon, and Nancy Wong (2016), "The Ant and the

Grasshopper: Understanding Personal Saving Orientation of Consumers," Journal of Consumer

Research, 43 (1), 134-155.

Farrell, Joseph and Matthew Rabin (1996), “Cheap Talk,” The Journal of Economic Perspectives,

10 (3), 103-118.

Fast, Lisa A. and David C. Funder (2008), “Personality as Manifest in Word Use: Correlations with

Self-Report, Acquaintance Report, and Behavior,” Journal of Personality and Social

Psychology, 94 (2), 334-346.

Feldman, Gilad, Huiwen Lian, Michal Kosinski, and David Stillwell (2017), “Frankly, We Do Give

A Damn: The Relationship Between Profanity and Honesty,” Social Psychological and

Personality Science, 8 (7), 816-826.

Fernandes, Daniel, John G. Lynch Jr, and Richard G. Netemeyer (2014), “Financial Literacy,

Financial Education, and Downstream Financial Behaviors," Management Science, 60 (8),

1861-1883.

Page 42: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

42

Freitas, Antonio L., Nira Liberman, Peter Salovey, and E. Tory Higgins (2002), “When to Begin?

Regulatory Focus and Initiating Goal Pursuit,” Personality and Social Psychology Bulletin

28(1), 121-130.

Griffiths, Thomas L. and Mark Steyvers (2004), “Finding Scientific Topics,” Proceedings of the

National academy of Sciences, 101 (1), 5228-5235.

Gross, Jacob P.K., Cekic Osman, Don Hossler, and Nick Hillman (2009), “What Matters in Student

Loan Default: A Review of the Research Literature," Journal of Student Financial Aid, 39 (1),

19-29.

Hancock, Jeffrey T., Lauren E. Curry, Saurabh Goorha, and Michael Woodworth (2007), “On

Lying and Being Lied to: A Linguistic Analysis of Deception in Computer-Mediated

Communication,” Discourse Processes, 45 (1), 1-23.

Hansen, Stephen, Michael McMahon, and Andrea Prat (2014), "Transparency and Deliberation within

the FOMC: A Computational Linguistics Approach,” Working Paper, Columbia University.

Harkness, Sarah K. (2016), “Discrimination in Lending Markets: Status and the Intersections of

Gender and Race,” Social Psychology Quarterly, 79 (1), 81-93.

Herzenstein, Michal, Scott Sonenshein, and Utpal M. Dholakia (2011), “Tell Me a Good Story and

I May Lend You My Money: The Role of Narratives in Peer-To-Peer Lending Decisions,”

Journal of Marketing Research, 48(SPL), S138-S149.

Hirsh, Jacob B. and Jordan B. Peterson (2009), “Personality and Language Use in Self-Narratives,”

Journal of Research in Personality, 43 (3), 524-527.

Hofman, Jake M., Amit Sharma, and Duncan J. Watts (2017), “Prediction and Explanation in Social

Systems,” Science, 355 (6324), 486-488.

Hoffman, Matthew, Francis R. Bach, and David M. Blei (2010) “Online Learning for Latent

Dirichlet Allocation,” Advances in Neural Information Processing Systems, 856-864.

Humphreys, Ashlee and Rebecca Jen-Hui Wang (2018), “Automated Text Analysis for Consumer

Research,” Journal of Consumer Research, forthcoming.

Jurafsky, Dan, Victor Chahuneau, Bryan R. Routledge, and Noah A. Smith (2014), “Narrative

Framing of Consumer Sentiment in Online Restaurant Reviews,” First Monday, 19 (4).

Kosinski, Michal, David Stillwell, and Thore Graepel (2013), “Private Traits and Attributes are

Predictable from Digital Records of Human Behavior,” Proceedings of the National Academy of

Sciences, 110 (15), 5802-5805.

Page 43: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

43

Kupor, Daniella M., Kristin Laurin, and Jonathan Levav (2015), “Anticipating Divine Protection?

Reminders of God Can Increase Nonmoral Risk Taking,” Psychological Science, 26(4), 374-84.

Li, Jiwei, Myle Ott, Claire Cardie, and Eduard Hovy (2014), “Towards a General Rule for

Identifying Deceptive Opinion Spam,” Proceedings of the 52nd Annual Meeting of the

Association for Computational Linguistics, 1566-1576.

Loughran, Tim and Bill McDonald (2011), “When is a Liability Not a Liability? Textual Analysis,

Dictionaries, and 10-Ks,” The Journal of Finance, 66 (1), 35-65.

Lucas, Robert (1976), “Econometric Policy Evaluation: A Critique,” The Phillips Curve and Labor

Markets, Carnegie-Rochester Conference Series on Public Policy. New York: Elsevier, 19–46.

Lynch, John G., Richard G. Netemeyer, Stephen A. Spiller, and Alessandra Zammit (2010), “A

Generalizable Scale of Propensity to Plan: The Long and the Short of Planning for Time and for

Money,” Journal of Consumer Research 37(1), 108-128.

Mairesse, François, Marilyn A. Walker, Matthias R. Mehl, and Roger K. Moore (2007), “Using

Linguistic Cues for the Automatic Recognition of Personality in Conversation and Text,”

Journal of Artificial Intelligence Research, 30, 457-500.

Martens, Martin L., Jennifer E. Jennings, and P. Devereaux Jennings (2007), “Do the Stories They

Tell Get Them the Money They Need? The Role of Entrepreneurial Narratives in Resource

Acquisition,” Academy of Management Journal, 50 (5), 1107-1132.

Matz, Sandra and Oded Netzer (2017), “Using Big Data as a Window into Consumers’

Psychology,” Current Opinion in Behavioral Sciences, 18 (December), 7-12,

Mayer, Christopher, Karen Pence, and Shane M. Sherlund (2009), “The Rise in Mortgage

Defaults." The Journal of Economic Perspectives, 23 (1), 27-50.

McLaughlin, G. Harry (1969), “SMOG Grading—A New Readability Formula,” Journal of

Reading, 12(8), 639-646.

McAdams, Dan P. (2001), “The Psychology of Life Stories,” Review of General Psychology, 5(2),

100-123.

Mehl, Matthias R., Samuel D. Gosling, and James W. Pennebaker (2006), “Personality in its

Natural Habitat: Manifestations and Implicit Folk Theories of Personality in Daily Life,”

Journal of Personality and Social Psychology, 90 (5), 862-877.

Netemeyer, Richard G., Dee Warmath, Daniel Fernandes, and John Lynch Jr. (2018), “How Am I

Doing? Perceived Financial Well-Being, Its Potential Antecedents, and Its Relation to Overall

Page 44: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

44

Well-Being,” Journal of Consumer Research, Forthcoming.

Netzer, Oded, Ronen Feldman, Jacob Goldenberg, and Moshe Fresko (2012), “Mine Your Own

Business: Market-Structure Surveillance through Text Mining,” Marketing Science, 31(3), 521-43.

Newman, Matthew L., James W. Pennebaker, Diane S. Berry, and Jane M. Richards (2003), “Lying

Words: Predicting Deception from Linguistic Styles,” Personality and Social Psychology

Bulletin, 29 (5), 665-675.

Nicholson, Nigel, Emma Soane, Mark Fenton-O'Creevy, and Paul Willman (2005), “Personality

and Domain-Specific Risk Taking,” Journal of Risk Research, 8 (2), 157-176.

Norvilitis, Jill M., Michelle M. Merwin, Timothy M. Osberg, Patricia V. Roehling, Paul Young,

and Michele M. Kamas (2006), “Personality Factors, Money Attitudes, Financial Knowledge,

and Credit-card Debt in College Students,” Journal of Applied Social Psychology, 36(6), 1395.

Nyhus, Ellen K., and Paul Webley (2001), “The Role of Personality in Household Saving and

Borrowing Behaviour,” European Journal of Personality, 15 (S1), S85-S103.

Oppenheimer, Daniel M. (2006), “Consequences of Erudite Vernacular Utilized Irrespective of

Necessity: Problems with Using Long Words Needlessly,” Applied Cognitive Psychology, 20

(2), 139-156.

Ott, Myle, Claire Cardie, and Jeff Hancock (2012), “Estimating the Prevalence of Deception in

Online Review Communities,” In Proceedings of the 21st International Conference on World

Wide Web, 201-210.

Palmer, Christopher (2015), “Why did so many subprime borrowers default during the crisis: Loose

credit or plummeting prices?” Working Paper, University of California Berkeley.

Pasupathi, Monisha (2007), “Telling and the Remembered Self: Linguistic Differences in Memories

for Previously Disclosed and Previously Undisclosed Events,” Memory, 15 (3), 258-270.

Pennebaker, James W. (2011), “The Secret Life of Pronouns,” New Scientist, 211(2828), 42-45.

Pennebaker, James W. and Lori D. Stone (2003), “Words of Wisdom: Language Use over the Life

Span,” Journal of Personality and Social Psychology, 85 (2), 291-301.

Pennebaker, James W. and Anna Graybeal (2001), “Patterns of Natural Language Use: Disclosure,

Personality, and Social Integration,” Current Directions in Psychological Science, 10 (3), 90-3.

Pennebaker, James W. and Laura A. King (1999), “Linguistic Styles: Language Use as an

Individual Difference,” Journal of Personality and Social Psychology, 77 (6), 1296-1312.

Pennebaker, James W., Tracy J. Mayne, and Martha E. Francis (1997), “Linguistic Predictors of

Page 45: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

45

Adaptive Bereavement,” Journal of Personality and Social Psychology, 72 (4), 863-871.

Pompian, Michael M. and John M. Longo (2004), “A New Paradigm for Practical Application of

Behavioral Finance: Creating Investment Programs Based on Personality Type and Gender to

Produce Better Investment Outcomes,” Journal of Wealth Management, 7, 9-15.

Pope, Devin G. and Justin R. Sydnor (2011), “What’s in a Picture? Evidence of Discrimination

from Prosper.com,” Journal of Human Resources, 46 (1), 53-92.

Preotiuc-Pietro, Daniel, et al. (2015), “The Role of Personality, Age and Gender in Tweeting about

Mental Illnesses,” NAACL HLT, 21-30.

Rudebusch, Glenn D. (2005), “Assessing the Lucas Critique in Monetary Policy Models." Journal

of Money, Credit, and Banking, 37(2), 245-272.

Rugh, Jacob S., and Douglas S. Massey (2010), “Racial Segregation and the American Foreclosure

Crisis,” American Sociological Review, 75 (5), 629-651.

Rustichini Aldo, Colin DeYoung, Jon E. Anderson, Stephen Burks (2016), “Toward the Integration

of Personality Theory and Decision Theory in Explaining Economic Behavior: An

Experimental Investigation,” Journal of Behavioral and Experimental Economics, 64, 122-137.

Schwartz, H. Andrew, et al. (2013), “Personality, Gender, and Age in the Language of Social

Media: The Open-Vocabulary Approach,” Plos One, 8 (9), e73791.

Sengupta, Rajdeep, and Geetesh Bhardwaj (2015), “Credit Scoring and Loan Default."

International Review of Finance, 15 (2), 139-167.

Shah, Anuj K., Sendhil Mullainathan, and Eldar Shafir (2012), “Some Consequences of Having

Too Little,” Science, 338 (6107), 682-685.

Sievert, Carson and Kenneth E. Shirley (2014), "LDAvis: A Method for Visualizing and

Interpreting Topics." Proceedings of the workshop on interactive language learning,

visualization, and interfaces, 63-70.

Slatcher, Richard B. and James W. Pennebaker (2006), “How Do I Love Thee? Let Me Count The

Words,” Psychological Science, 17 (8), 660-664.

Soman, Dilip and Amar Cheema (2011), “Earmarking and Partitioning: Increasing Saving by Low-

Income Households,” Journal of Marketing Research, 48(SPL), S14-S22.

Sonenshein, Scott, Michal Herzenstein, and Utpal M. Dholakia (2011), “How Accounts Shape

Lending Decisions Through Fostering Perceived Trustworthiness,” Organizational Behavior

and Human Decision Processes, 115 (1), 69-84.

Page 46: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

46

Tausczik, Yla R. and James W. Pennebaker (2010), “The Psychological Meaning of Words: LIWC

and Computerized Text Analysis Methods,” Journal of Language and Social Psychology, 29

(1), 24-54.

Tibshirani, Robert (1997), “The Lasso Method for Variable Selection in the Cox Model,” Statistics

in Medicine, 16(4), 385–395.

Tirunillai, Seshadri and Gerard J. Tellis (2012), “Does Chatter Really Matter? Dynamics of User-

Generated Content and Stock Performance,” Marketing Science, 31 (2) 198-215.

Toma, Catalina L. and Jeffrey T. Hancock (2012), “What Lies Beneath: The Linguistic Traces of

Deception in Online Dating Profiles,” Journal of Communication, 62 (1), 78-97.

Tumasjan, Andranik, Timm Oliver Sprenger, Philipp G. Sandner, and Isabell M. Welpe (2010),

“Predicting Elections with Twitter: What 140 Characters Reveal About Political Sentiment,”

ICWSM, 178-185.

Van Heerde, Harald J., Marnik G. Dekimpe, and William P. Putsis Jr. (2005), “Marketing Models

and the Lucas Critique,” Journal of Marketing Research, 42 (1), 15-21.

Ventrella, Jeffrey J. (2011), Virtual Body Language, ETC Press.

Wei, Yanhao, Pinar Yildirim, Christophe Van den Bulte, and Chrysanthos Dellarocas (2015),

“Credit Scoring with Social Network Data,” Marketing Science, 35 (2), 234-258.

Weiss, Brent and Robert S. Feldman (2006), “Looking Good and Lying to Do It: Deception as an

Impression Management Strategy in Job Interviews,” Journal of Applied Social Psychology, 36

(4), 1070-1086.

Weinstein, Neil D. (1980) “Unrealistic Optimism about Future Life Events,” Journal of Personality

and Social Psychology, 39(5), 806-820.

Whalen, Sean and Gaurav Pandey (2013), “A Comparative Analysis of Ensemble Classifiers: Case

Studies in Genomics,” In Data Mining (ICDM), 2013 IEEE 13th International Conference, 807-

816.

Yarkoni, Tal (2010), “Personality in 100,000 Words: A Large-Scale Analysis of Personality and

Word Use among Bloggers,” Journal of Research in Personality, 44 (3), 363-373.

Page 47: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

47

Table 1. Descriptive statistics for the Prosper data (n = 18,312)

Variables Min Max Mean SD Freq. Amount requested 1,000 25,000 6,507.3 5,732.9 Debt-to-income ratio 0 10.01 .33 .89 Lender interest rate 0 .350 .180 .077 Number of words in description 1 766 207.9 137.4 Number of words in title 0 13 4.593 2.015 % of long words (6+ letters) 0% 71.4% 29.8% 6.4% SMOG 3.129 12 11.347 1.045 Enchant spellchecker 0 56 2.986 3.074 # Prior listings 0 67 2.016 3.097 Credit grade: AA 0.086 A 0.082 B 0.186 C 0.219 D 0.170 E 0.128 HR 0.129 Loan repayment (1 = paid, 0 = defaulted) 0.669 Loan image dummy 0.670 Home owner dummy 0.470

Table 2. Area under the curve (AUC) for models with text only, financial and demographics information only, and a combination of both

(1) Text only

(2) Financial/

demog.

(3) Text & financial/

demog.

Improvement from (2) to (3)

Low credit grades: D, E, HR 61.33% 62.44% 64.96% 4.03%**

Medium credit grades: B, C 62.37% 65.72% 68.13% 3.67%**

High credit grades: AA, A 71.31% 76.05% 78.09% 2.68%**

Overall AUC 66.68% 70.52% 72.56% 2.89%**

Jaccard Index 35.85% 37.85% 38.85% 2.64%**

Notes: all AUCs are the averaged across 10 replications of 10-folds mean. See Figure 1 for a plot of the receiver operating characteristic (ROC) curve for the average across on randomly selected 10 fold. The Jaccard index is calculated as N00/(N01+N10+N00), where N00 is the number of correctly predicted defaults, N01 and N10 are the numbers of mispredicted repayments and defaults, respectively. ** represents significant improvements at the 0.05 level.

Page 48: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

48

Table 3. The seven LDA topics and representative words with highest relevance

LDA topic Words with highest relevance (λ = 0.5)

Employment and School Work, Job, Full, School, Year, College, Income, Employ, Student

Interest Rate Reduction Debt, Interest, Rate, High, Consolidate, Score, Improve, Lower

Expenses Explanation Expense, Explain, Cloth, Entertainment, Cable, Why, Utility, Insurance, Monthly

Business and Real Estate Business, Purchase, Company, Invest, Fund, Addition, Property, Market, Build, Cost, Sell

Family Bill, Try, Family, Life, Husband, Medical, Reality, Care, Give, Children, Hard, Daughter, Chance, Son, Money, Divorce

Loan Details and Explanations Loan, Because, Candidate, Situation, Financial, Purpose, House, Expense, Monthly, Income

Monthly Payment Month, Payment, Paid, Total, Account, Rent, Mortgage, Save, List, Every, Payday, Budget Note: the sample words are chosen based on the relevance measure with λ = 0.5. See Table A8 in the Web Appendix for list of words with top relevance in each topic. Table 4. Binary regression with the thirteen LDA topics (repayment = 1)*

Financial and loan related variables Estimate (Std. E)

Textual variables Estimate (Std. E)

Amount Requested (in $105) -7.53 (0.36)

Number of words in Description (in 104) -4.02 (2.04)

Credit Grade HR -0.83 (0.08)

Number of spelling mistakes 0.00 (0.01)

Credit Grade E -0.47 (0.08)

SMOG (in 103) -1.8 20.9

Credit Grade D -0.34 (0.06)

Words with 6 letters or more -0.51 (0.40)

Credit Grade C -0.20 (0.06)

Number of words in the title (in 103) -7.70 (9.01)

Credit Grade A 0.81 (0.08)

Employment and school 1.95 (0.43)

Credit Grade AA 0.26 (0.07)

Interest rate reduction 2.68 (0.42)

Debt To Income -0.09 (0.02)

Expenses explanation -0.54 (0.61)

Images 0.06 (0.04)

Business and real estate loan 0.57 (0.38)

Home Owner Status -0.33 (0.04)

Family -1.28 (0.43)

Lender Interest Rate -5.20 (0.31)

Monthly payments 0.65 (0.40)

Bank Draft Fee Annual Rate -35.64 (19.52)

Prior Listings -0.03 (0.01)

Intercept 2.51 (0.40)

* Bold face for P-value ≤ 0.05. For brevity we do not report in this table the estimates of the demographics variables such as location, age, gender and race.

Page 49: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

49

Table 5. Binary regression with LIWC (repayment = 1)*

Variable Beta (Std. E)

Variable Beta (Std. E)

Variable Beta (Std. E)

Variable Beta (Std. E)

Financial and basic text variables: LIWC dictionary: Amount Requested(x 105)

-7.163 (0.3668)

Swear words 35.5112 (35.275)

Past words -2.1032 (1.9895)

Person pronoun words

0.4119 (6.6093)

Credit Grade HR -0.8551 (0.0844)

Filler words 13.3939 (6.224)

Inhibition words

-2.3047 (3.4172)

Work words 0.5175 (0.9333)

Credit Grade E -0.4642 (0.0817)

Perception words 13.4328 (10.839)

Home words -2.3822 (1.7643)

Sexual words -10.5097 (10.828)

Credit Grade D -0.3383 (0.0623)

Relative words 9.1729 (2.3748)

Hear words -2.4191 (14.038)

They words -15.491 (9.3357)

Credit Grade C -0.1959 (0.0559)

Friend words 9.7894 (7.0217)

I words -2.7392 (8.1836)

Positive emotion words

0.2869 (2.0477)

Credit Grade A 0.7837 (0.0802)

Anxiety words 8.7494 (8.9305)

Tentative words -2.8712 (2.0522)

Money words 0.2085 (0.7944)

Credit Grade AA 0.2838 (0.0692)

Negate words 6.0709 (3.3228)

Non-fluency words

-3.2295 (9.518)

Ingest words -0.0434 (5.279)

Debt To Income -0.0906 (0.0186)

Insight words 5.0732 (2.8214)

Anger words -3.2911 (9.7405)

Verbs words -0.1936 (1.3174)

Images 0.0599 (0.0389)

We words 4.1277 (8.3628)

Achieve words -3.3204 (1.5601)

Adverbs words -0.3578 (1.8814)

Home Owner Status -0.3199 (0.0381)

Pronoun words 3.7935 (9.9981)

Incline words -3.5433 (2.3316)

Functional words -0.9427 (1.8725)

Lender Interest Rate -5.2556 (0.3148)

Exclusion words 3.1073 (2.7497)

She/he words -3.5689 (7.3598)

Bios words -1.3376 (2.6575)

Bank Draft Fee Annual Rate

-33.9126 (19.509)

Sad words 2.9955 (6.4981)

You words -3.714 (8.8219)

Assent words -1.3651 (14.463)

Prior Listings -0.0236 (0.0058)

Quantitative words

2.7495 (1.9363)

Cause words -3.7248 (2.407)

Family words -1.4804 (2.8298)

Number of words in Description(x 104)

-3.494 (1.96)

Articles 2.457 (2.0896)

Social words -4.2882 (1.5697)

I pronoun words -1.72 (10.026)

Number of spelling mistakes

-0.0124 (0.0068)

Numbers words 2.2907 (2.7328)

Health words -4.7602 (4.2679)

Death words -16.3445 (10.721)

SMOG -0.0252 (0.0209)

Preposition words 2.1719 (1.8415)

Certain words -5.2433 (2.7262)

Body words -19.1156 (5.8326)

Words with 6 letters or more

0.4455 (0.5716)

Conjoint words 1.8673 (1.8392)

Present words -6.223 (1.7067)

Religion words -20.2865 (6.7741)

Number of words in the title

-0.0062 (0.6035)

Auxiliary verbs words

1.7732 (2.3818)

Human words -7.5781 (3.5803)

Feel words -24.617 (11.734)

Affect words 1.2929 (1.5234)

Space words -8.2317 (2.5648)

See words -10.5021 (11.597)

Discrepancy words

1.2769 (2.686)

Future words -8.4576 (3.5391)

Leisure words 0.5548 (2.6577)

Cognitive mechanism words

0.7625 (1.8828)

Motion words -9.1849 (2.8071)

Intercept 3.6557 (0.6035)

Negative emotion words

0.7453 (4.7407)

Time words -9.4218 (2.3077)

* Bold face for P-value ≤ 0.05. For brevity we do not report in this table the estimates of the demographics variables such as location, age,

gender and race.

Page 50: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

50

Figure 1. Receiver operating characteristics (ROC) curves for models with text only, financial and demographics information only, and a combination of both

Figure 2. Words indicative of loan repayment

Note: The most common words appear in the middle cloud (cutoff = 1:1.5) and then organized by themes. On the top-right, in green, and clockwise: relative words, time related words, “I words, words related to borrowing and debt, and words related to a brighter financial future.

Page 51: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

51

Figure 3. Words indicative of loan default

Note: The most common words appear in the middle cloud (cutoff = 1:1.5) and then organized by themes. On the top, in black, and clockwise: words related to explanations, external influence and others, future tense, time, work, extremity, appealing to lenders, financial hardship, hardship, medical and family issues, and desperation and plea.

Figure 4: Naïve Bayes analysis for loan funding and loan default

Page 52: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

52

WEB APPENDIX

When Words Sweat: Identifying Signals for Loan Default in the Text of Loan Applications

Table of Contents 1. Additional Information about our Analyses ................................................................... 53

A. Procedure for coding of the profile pictures for age, gender, and race ......................... 53

B. Random Forest and Extra Tress .................................................................................. 53

C. L1 regularization regression - predictive results .......................................................... 54

D. Latent Dirichlet allocation (LDA) - predictive results ................................................. 55

E. Linguistic Inquiry and Word Count (LIWC) - predictive results.................................. 55

2. Additional Tables and Figures ........................................................................................ 56

Table A1: Correspondence between Prosper’s credit grades and FICO scores ....................... 56

Table A2: Area under the curve (AUC) of the underlying models of the ensemble ................ 56

Table A3: L1 regularization binary logistic regression (1 = repayment). ................................ 57

Table A4a: Bi-grams that appeared frequently in repaid loans ............................................... 64

Table A4b: Bi-grams that appeared frequently in defaulted loans .......................................... 67

Table A5: Top 120 variables with the highest importance in the Random Forest analysis ...... 70

Table A6: Summary statistics of dataset of all loan requests (n = 122,479) ............................ 71

Table A7a: Bi-grams that appeared frequently in funded loans .............................................. 72

Table A7b: Bi-grams that appeared frequently in unfunded loans .......................................... 76

Table A8: Lists of the top 30 words with the highest relevance measure for each LDA topic 79

Figure A1: LDA analysis – selecting the number of topics based on perplexity ..................... 81

Page 53: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

53

1. Additional Information about our Analyses

A. Procedure for coding of the profile pictures for age, gender, and race

About a third of the borrowers’ profiles in our data (6,078 profiles) included at least one picture that is not a stock photo, however many pictures were not of the borrower, or included more than one person. To identify the borrower in the picture we manually coded the borrower’s profile pictures, using the following process. If the picture included captions, we relied on it to identify the borrower (for example, “My lovely wife and I”). If the picture did not include captions and there was one adult in the picture, we assumed the adult in the picture was the borrower (following the procedure in Pope and Sydnor 2011). Once borrowers were identified, we recorded their gender (Female, Male, “Cannot Tell”), age (in three brackets: Young, Middle-aged, Old), and race (Caucasian, African American, Asian, Hispanic, or “Cannot Tell”). If the picture included more than one adult and there were no captions or if the picture did not include any adult (e.g., the picture included kids, pets, or a kitchen project) we could not identify the borrower and therefore defined the gender and race of that picture as “cannot tell”. We augmented the age in unidentified pictures with the average age of the identified pictures with the three ages categories coded as 1, 2 and 3, respectively. Each picture was evaluated by at least two different undergraduate student coders, who were unaware of the research objective. Cohen Kappas suggest fairly high levels of agreement across coders, gender = 0.89, race = 0.67, and age = 0.44.10 Disagreements were resolved by an additional coder who served as the final judge, observing the rating of the previous coders. B. Random Forest and Extra Tress

Random Forest and Extremely Randomized Trees (Extra Trees) are ensemble of trees. The idea behind both models is to combine a large number of decision trees. In these models, trees are chosen to resolve misclassification of previously included trees. The Random Forest randomly draws with replacements subsets of the calibration data to fit each tree, and a random subset of features (variables) is used in each tree. In the Variance Selection Random Forest features are chosen based on a variance threshold determined by cross validation. The idea behind variance selection threshold is to remove features that don’t meet certain threshold. By definition, features that have zero variance (same value in all samples) are removed (for further details see http://scikit-learn.org/stable/modules/feature_selection.html#variance-threshold). We tested the variance in the range of 0.001-0.0650 by increments of 0.00025. We find the variance to be in the range of 0.00175-0.00375 across folds. In the Best Feature Selection Random Forest features are selected based on a 𝜒N test. That is, we select the K-features with the highest 𝜒N score (other approaches include F-values or mutual information criteria. For further details see http://scikit-learn.org/stable/modules/generated/sklearn.feature_selection.SelectKBest.html#sklearn.feature_s

10 Because agreement across coders for age was lower, we also tested a model without this variable. Excluding the age variable did not qualitatively affect our results.

Page 54: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

54

election.SelectKBest). We use cross-validation to determine the “optimal” value of K. We allowed K to vary between a minimum of 10 features and a maximum of half of the training features (over 500 features in our case), by increment of 50 features. We find the number of feature to be in the range of 60-260 across folds. The Extra Trees is an extension of the Random Forest in which the cut-off point (the split) for each feature in the tree are also chosen at random (from a uniform distribution) and the best split among them is chosen (for further details see http://scikit-learn.org/stable/modules/generated/sklearn.tree.ExtraTreeClassifier.html). We use the maximal and minimal value of each feature observed in the data to select the boundaries of the uninform distribution for each feature. See Due to the size of the feature space, we first apply a K-Best Feature Selection, as described above, to select the features to be included in the Extra Trees. We find the number of feature to be in the range of 60-460 across folds. For all tree-based methods, to limit over-fitting of the trees, we randomized the parameter optimization (Bergstra and Bengio 2012) using a 3-fold cross validation on the calibration data to determine the structure of the tree (e.g., number leaves, number of splits, depth of the tree, and criteria). We use a randomized parameter optimization rather an exhaustive search (or a grid search) due to the large number of variables in our model. The parameters are sampled from a distribution (uniform) over all possible parameter values. We set the ranges for the parameters that dictate the structure of the trees as follows:

• Number of leaves [1-11] • Depth of the tree [3 - max number of features] • Minimum sample split [1-11] • Min sample leaf [1-11] • Criteria for splits [Gini or Entropy]

Reference Bergstra, James, and Yoshua Bengio (2012), "Random Search for Hyper-Parameter Optimization." Journal of Machine Learning Research, 13 (Feb), 281-305. C. L1 regularization regression - predictive results

We tested whether the naïve Bayes findings are sensitive to the inclusion of demographics and financial information and the interdependence among words we employ a logistic regression with an L1 penalization with same 1,032 bi-grams used in the ensemble learning and naïve Bayes analysis as well as the demographic and financial information. This analysis, while less easily interpretable than the naïve Bayes, provided very similar qualitative results (see Tables A3 and A4). The correlation between the results of the naïve Bayes and the L1 regression is 0.582 (P < 0.01). The L1 regression results confirm that the writing styles and intentions we identified through the naïve Bayes analysis are not merely a proxy of the demographic and financial information.

Page 55: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

55

D. Latent Dirichlet allocation (LDA) - predictive results

Although the purpose of the LDA analysis was to learn about the topics discussed in loan requests rather than to predict default, we nevertheless tested the predictive ability of the uncovered topics. We find that the model that includes the LDA topics fits the data better than a model that does not include the textual information in terms of the Akaike information criterion (AICLDA = 20,872 and AICnotext = 21,078). Furthermore, the likelihood ratio test significantly supports the model with the textual information relative to the model without the textual information (LRDF=17 = 227.34, p < 0.001). We ran a 10-fold cross validation similar to the one conducted for the ensemble learning model. We find that the model with the LDA topics and the other textual variables (e.g., number of characters in the loan request) predicts defaults better than a baseline model that includes all the financial and demographic information but no textual information (AUCLDA = 71.4% vs. AUCnoLDA = 70.1%). The model with the LDA variables provided higher AUC relative to the model without the textual information in all 10 folds.11

E. Linguistic Inquiry and Word Count (LIWC) - predictive results

We find that the model that includes the LIWC dictionaries fits the data better than a model that does not include the textual information in terms of the Akaike information criterion (AICtext = 20,900 and AICnotext = 21,078), and the likelihood ratio test (LRDF=69 = 319.94, p < 0.001). To test for the predictive ability of this model we ran a 10-fold cross validation similar to the one conducted for the ensemble learning model. We find that the model with LIWC predicts defaults better than a baseline model that includes all the financial and demographic information but no textual information in all 10-folds (average AUCLIWC = 70.9% vs. AUCnoLIWC = 70.1%).

11 We also estimate an ensemble predictive model with the LDA topics (in addition to the 1,032 bi-grams). The AUC of the ensemble with the LDA (AUC = 72.54%), is very similar to the ensemble without the LDA (AUC=72.56%).

Page 56: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

56

2. Additional Tables and Figures

Table A1: Correspondence between Prosper’s credit grades and FICO scores

Grade AA A B C D E HR Score 760+ 720-759 680-719 640-679 600-639 560-599 520-559

Table A2: Area under the curve (AUC) of the underlying models of the ensemble

Text Only (Model 1)

Financial/demog.

(Model 2)

Text & Financial/demog

(Model 3) Logistic L1 65.41% 70.55% 71.86%*** Logistic L2 66.82% 68.98% 71.62%*** Random Forrest (Variance Selection) 64.56% 69.68% 71.67%*** Random Forrest (Best Features Selection) 62.60% 70.48% 71.33%*** Extremely Randomized Trees (Extra Trees) 65.01% 70.35% 71.23%*** The Stacking Ensemble Model 66.68% 70.52% 72.56%***

*** Difference between the Model 2 and Model 3 is statistically significant at the 0.01 level based on the binomial test.

Page 57: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

57

Table A3: L1 regularization binary logistic regression (1 = repayment).

Results for variables with β ≠ 0 Variable Beta Variable Beta Variable Beta Amount Requested (x 1000) -0.0684 health 2.0569 step 0.9874 Credit Grade HR -0.7297 year ago 1.8547 ani question 0.9852 Credit Grade E -0.3821 side 1.7264 payment and 0.9688 Credit Grade D -0.2980 lower interest 1.6991 minimum 0.9558 Credit Grade C -0.1592 august 1.6706 earli 0.9527 Credit Grade A 0.8027 borrow 1.6181 unfortun 0.9439

Credit Grade AA 0.3022 prosper lender 1.5487 last year 0.9366 Debt To Income Ratio -0.0840 com 1.4588 your 0.9358 Images 0.0155 than 1.4532 your consider 0.9038 Is Borrower Homeowner -0.2856 averag 1.4005 while 0.9003 Lender Rate -5.1851 lend 1.3887 fall 0.8904 Bank Draft Fee Annual Rate 0.0000 card debt 1.3826 loan payment 0.8880 New England 0.0754 bonu 1.3187 student 0.8749 Middle East 0.2965 car payment 1.3083 the cost 0.8748 Great Lakes 0.0708 card with 1.3026 run 0.8726 Plains Regions 0.0443 and current 1.2919 larg 0.8598 South West 0.0179 consult 1.2630 tax 0.8495 Rocky Mountain 0.2650 dure 1.2123 anoth 0.8416 Far West 0.0726 few month 1.2043 low 0.8403 Military 1.3085 pay for 1.1987 contribut 0.8180 # number of words in description -0.0006 graduat 1.1670 and had 0.8006 Spelling Mistakes -0.0033 and plan 1.1565 and our 0.7959 SMOG -0.0237 wed 1.1529 thi debt 0.7940 % of words greater 6 -0.1602 but have 1.1525 cover the 0.7819 Gender Male 0.1377 goe 1.1193 colleg 0.7779 Gender Female 0.0000 pay thi 1.1160 electr 0.7696 Age -0.1769 car insur 1.1069 coupl 0.7643 Race White 0.1454 the credit 1.1045 the first 0.7629 Race African American -0.2214 purchas 1.0846 inform 0.7600 Race Asian 0.2814 payment for 1.0775 order 0.7490 Race Hispanics 0.0000 the debt 1.0718 the other 0.7487 # Prior Loan Listings -0.0209 student loan 1.0715 appli 0.7459 Race Unknown 0.0000 reflect 1.0680 off the 0.7414 Gender Unknown 0.0565 last 1.0514 the minimum 0.7373 Attractive -0.0016 and get 1.0505 sinc 0.7351 Trustworthy 0.0729 off thi 1.0500 bank 0.7313 reinvest 2.8241 save 1.0237 longer 0.7272 the balanc 2.1761 incom ratio 1.0193 activ 0.7182 invest 2.0809 even 1.0133 rental 0.7074

Page 58: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

58

Variable Beta Variable Beta Variable Beta

almost 0.7036 past year 0.5703 both 0.3572 job with 0.6949 entir 0.5637 off with 0.3567 the high 0.6933 manag 0.5606 balanc 0.3567 the process 0.6901 file 0.5563 along 0.3470 ever 0.6884 futur 0.5561 water 0.3456 never 0.6850 into one 0.5422 comput 0.3425 earn 0.6843 posit 0.5286 year have 0.3397 good job 0.6731 quickli 0.5212 travel 0.3376 you will 0.6654 use credit 0.5162 been pay 0.3375 those 0.6636 turn 0.5133 have steadi 0.3373 june 0.6634 learn 0.5127 chang 0.3373 car loan 0.6623 and are 0.5058 paid off 0.3364 free 0.6619 understand 0.5048 payment thi 0.3298 get out 0.6595 each 0.4864 ga 0.3286 over the 0.6592 big 0.4853 singl 0.3233 see 0.6582 togeth 0.4812 five 0.3202 decid 0.6573 rebuild 0.4773 toward 0.3172 the payment 0.6539 too 0.4693 clear 0.3144 that can 0.6518 detail 0.4652 happen 0.3057 but the 0.6460 share 0.4626 rather 0.3040 although 0.6349 salari 0.4526 becaus have 0.2977 prosper and 0.6347 instead 0.4518 interest credit 0.2947 creat 0.6346 everi month 0.4427 payment the 0.2932 teacher 0.6333 though 0.4399 been with 0.2887 could 0.6283 improv credit 0.4349 stabl 0.2859 less 0.6274 way 0.4283 addit 0.2842 least 0.6246 life 0.4254 for your 0.2774 they are 0.6234 avail 0.4249 small 0.2750 return 0.6229 year monthli 0.4133 engin 0.2715 have two 0.6161 have not 0.4033 interest rate 0.2639 with prosper 0.6154 reliabl 0.4013 the past 0.2581 debt free 0.6124 under 0.3927 part 0.2568 mistak 0.6069 max 0.3927 fee 0.2567 realiz 0.5979 for over 0.3883 mean 0.2546 grow 0.5967 number 0.3883 owner 0.2525 problem 0.5948 should 0.3846 thi will 0.2508 account 0.5821 close 0.3829 through 0.2472 have great 0.5812 again for 0.3711 myself 0.2466 abov 0.5743 risk 0.3683 there 0.2465 elimin 0.5716 experi 0.3679 half 0.2408 everi 0.5711 it 0.3669 go 0.2362

Page 59: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

59

Variable Beta Variable Beta Variable Beta

financ 0.5705 loan from 0.3623 thank for 0.2361 teach 0.2320 truck 0.1161 never miss 0.0108 build 0.2260 recent 0.1140 ask 0.0083 prior 0.2205 offer 0.1118 hous and 0.0082 sure 0.2114 car and 0.1099 live 0.0007 our 0.2083 degre 0.1077 and want 0.0005 extra 0.2052 expect 0.1050 can pay -0.0015 appreci 0.2037 remain 0.1020 respons -0.0019 employe 0.2019 into 0.1006 title -0.0020 our credit 0.1998 have veri 0.1000 oblig -0.0043 promot 0.1981 veri good 0.0940 rent -0.0074 well 0.1887 right now 0.0939 loan need -0.0109 improv 0.1859 system 0.0917 equiti -0.0118 husband and 0.1858 than the 0.0873 room -0.0123 point 0.1830 retir 0.0871 finish -0.0134 the bank 0.1789 loan thank 0.0827 book -0.0134 continu 0.1787 excel credit 0.0787 public -0.0207 made 0.1786 provid 0.0765 profit -0.0209 make payment 0.1780 apart 0.0736 look for -0.0216 extrem 0.1686 solid 0.0715 high interest -0.0217 more than 0.1682 until 0.0706 prioriti -0.0280 own home 0.1663 after 0.0692 the interest -0.0394 major 0.1657 soon 0.0654 includ -0.0449 budget 0.1645 down 0.0650 perfect -0.0501 miss payment 0.1592 show 0.0641 month for -0.0505 marri 0.1557 incur 0.0633 what -0.0510 debt that 0.1554 fix 0.0622 compani -0.0512 cover 0.1520 final 0.0581 and that -0.0517 bankruptci 0.1488 off credit 0.0566 stay -0.0530 help with 0.1478 onli 0.0564 school -0.0549 default 0.1474 expens are 0.0552 and can -0.0568 payoff 0.1466 possibl 0.0500 plan -0.0576 time have 0.1441 anyth 0.0467 state -0.0590 loan that 0.1437 collect 0.0442 and make -0.0601 schedul 0.1414 still 0.0429 home and -0.0603 credit score 0.1386 guarante 0.0322 veri respons -0.0668 cost 0.1365 current have 0.0290 own -0.0674 client 0.1287 here 0.0289 score and -0.0678 except 0.1283 next year 0.0262 have good -0.0686 given 0.1250 littl 0.0228 will allow -0.0725

Page 60: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

60

Variable Beta Variable Beta Variable Beta

oper 0.1200 and not 0.0197 are not -0.0747 establish 0.1168 month that 0.0167 histori -0.0777 loan which -0.0815 will also -0.2218 deduct -0.3361 strong -0.0848 stabl job -0.2237 loan have -0.3397 mother -0.0890 dream -0.2249 becaus the -0.3398 and would -0.0926 onc -0.2263 support -0.3432 year with -0.0978 father -0.2268 thi prosper -0.3438 save for -0.0981 between -0.2290 time job -0.3450 thi year -0.0990 the end -0.2320 doe -0.3550 veri hard -0.1002 expand -0.2336 and will -0.3597 may -0.1049 payment other -0.2350 husband -0.3667 bought -0.1081 mani -0.2414 age -0.3679 and credit -0.1087 the same -0.2479 seem -0.3682 leav -0.1095 equip -0.2535 three -0.3761 pleas help -0.1106 care -0.2591 rate and -0.3794 clean -0.1137 new -0.2669 develop -0.3945 relist -0.1153 day -0.2692 give -0.3978 credit rate -0.1189 year old -0.2722 again -0.4005 hope -0.1201 the loan -0.2723 tri -0.4014 repair -0.1223 find -0.2733 be -0.4063 work and -0.1231 do -0.2833 assist -0.4070 also have -0.1257 univers -0.2856 loan off -0.4094 top -0.1259 credit and -0.2858 surgeri -0.4115 for loan -0.1266 higher -0.2860 attend -0.4144 budget mortgag -0.1372 debt and -0.2875 difficult -0.4155 yr -0.1407 will make -0.2890 mine -0.4170 know -0.1408 list and -0.2894 charg -0.4228 dont -0.1483 score -0.3015 store -0.4231 plu -0.1489 and the -0.3016 expens car -0.4311 much -0.1546 wife -0.3033 educ -0.4327 need thi -0.1562 not have -0.3054 from the -0.4361 use thi -0.1639 abl -0.3069 mortgag -0.4368 what you -0.1782 love -0.3119 child -0.4401 help pay -0.1890 receiv -0.3119 incom and -0.4422 off all -0.1944 answer -0.3181 citi -0.4423 forward -0.2088 record -0.3208 repay thi -0.4484 thing -0.2097 time monthli -0.3215 locat -0.4554 juli -0.2101 call -0.3221 capit -0.4591 and just -0.2101 will pay -0.3234 afford -0.4605 product -0.2134 for our -0.3252 licens -0.4606

Page 61: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

61

Variable Beta Variable Beta Variable Beta

consolid -0.2138 item -0.3268 their -0.4619 request -0.2193 obtain -0.3278 prosper loan -0.4637 the money -0.2194 alway -0.3298 medic -0.4727 loan becausei -0.2216 card that -0.3342 properti -0.4818 come -0.4900 kid -0.6742 will have -1.1321 for and -0.4994 advanc -0.6784 person -1.1359 where -0.4999 ani -0.6847 took -1.1447 back thi -0.5013 just need -0.6878 sale -1.1680 around -0.5046 have alway -0.6892 industri -1.1713 loan pay -0.5072 dti -0.7008 maintain -1.1923 ago and -0.5108 with the -0.7013 gener -1.1978 pay back -0.5298 stress -0.7231 daughter -1.2414 and need -0.5309 per -0.7372 local -1.2763 within the -0.5328 divorc -0.7406 promis -1.2867 these -0.5471 monthli payment -0.7425 son -1.3027 abil -0.5481 the opportun -0.7440 get back -1.4633 and help -0.5500 off some -0.7455 god -1.5715 date -0.5503 refin -0.7578 estat -1.9481 field -0.5609 they -0.7642 lost -1.9870 verifi -0.5642 late -0.7703 thank you -2.6051 she -0.5651 need help -0.7771 report -0.5670 long -0.7893 Intercept 1.9431 the time -0.5673 dollar -0.7904 explain -0.5723 behind -0.7926 famili -0.5743 becausei -0.7933 valu -0.5941 bid -0.7949 have work -0.5974 been the -0.7964 everyth -0.6013 need the -0.8029 left over -0.6025 project -0.8136 loan with -0.6121 loan and -0.8236 the year -0.6137 monthli incom -0.8270 someon -0.6157 bit -0.8356 interest loan -0.6170 taken -0.8561 price -0.6324 and veri -0.8645 sever -0.6410 sourc -0.8747 then -0.6437 websit -0.9171 off high -0.6450 busi -0.9218 the compani -0.6452 them -0.9231 were -0.6502 total monthli -0.9243

Page 62: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

62

The table above reports the variables in the regression that were not set to zero. Below we list the variables that were set to zero. Note, that while one can use bootstrap approach to obtain standard errors for the L1 regularization binary logistic regression parameter estimates, because the parameters of the L1 regularization model are biased, standard errors in a regularized regression are not meaningful (Park and Casella 2008). Accordingly, we do not report standard errors in the Table A3. Variables with β = 0:

Financial and demographic variables:

Number of words in the title, Credit Grade = C, Images, Lender Rate, Bank Draft Fee Annual Rate, New England, Mid East, Great Lakes, Plains Regions, South West, Rocky Mountain, Far West, Military, Gender = Male, Gender = Female, Gender Unknown, Age, Race Unknown, Race = White, Race = African American, Race = Asian, Race = Hispanics, Words with 6 or more letters. Bi-grams (listed here alphabetically):

abl pay, about month, about year, account and, actual, ad, add, after tax, ago, ahead ,all bill, all credit, all debt, all our, all the, allow, almost year, alreadi, alway paid, alway pay, america, amount, and also, and for, and ha, and hope, and i'm, and now, and pay, and start, and take, and thank, and then ,and they, and thi, and wa, and worl, annual, approx, approxim, are good, are paid, area, ask for ,auto, automat, away, back the, back track, bad, base, becom, been employ, been late, been work, befor, begin, believ, below, benefit, best, better, bill time, bless, bring, busi and, buy, came, can get, can see, can't, card balanc, card financi, card have, career, case, cash, cash flow, catch, caus, cell, cell phone, chanc, check, child support, children, class, combin, commit, commun, compani and, compani for, complet, consid, consider, consolid credit, contact, contract, cours, credit histori, credit report, current employ, current work, custom, cut, deal, debt financi, debt have, debt incom, decis, delinqu, depend, deposit, did, did not, didn't, differ, direct, doe not, done, don't, don't have, down the, due, due the, dure the, each month, easili, emerg, employ,employ for, end, enjoy, enough, etc, everyon, excel, exist, expens and, expens for, expens ga, expens total, explain what, explain whi, far, feel, feel free, few, few year, figur, firm, first, flow, for almost, for consid, for financi, for month, for pay, for prosper, for take, for view, for year, found, four, friend, from prosper, full, full time, fulli, fund, further, ga util, get rid, get the, get thi, goal, god bless, gone, good credit, got, grade, great, greatli, groceri, gross, group, ha been, happi, hard work, have alreadi, have ani, have credit, have excel, have had, have learn, have made, have never, have one, have over, have paid, have problem, have some, have stabl, have the, hello, help get, help out, her, hi, higher interest, him, hold, honest, hospit,hour, how, howev, i'd, i'll, i'm, i'm not ,immedi, import, incom after, incom from, increas, individu, intend, into the, issu, it', i'v, i'v been, job and, job for, keep, know that, late payment, law, leas, left, lender, less than, lesson,let, level, like pay, limit, list, live with, loan back, loan consolid, loan credit, loan explain, loan financi, loan for, loan help, loan monthli, loan request, loan the, loan would, look,lot, lower, make the, market, medic bill, meet, member, minimum payment, misc, miss, mom, money and, money for, money pay, month ago, month and, month have, month monthli, monthli budget, mortgag rent, most,move, name, need pay ,never been, next,normal, not includ, not onli, note,

Page 63: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

63

now and, now have, off and, off debt, offic, old, one payment, one the, open, opportun, origin, our home, out the, outstand, over year, overtim, owe, paid for, paid full, parent, part time, pass, past, pay all, pay bill, pay down, pay the, pay them, paycheck, payday, payment have, payment prosper, payment time, payment will, peopl, per month, period, person loan, pictur, place, plan pay, pleas, post, present, pretti, previou, process, profession, profil, program, prosper payment, prosper will, prove, purpos thi, put, question, quit, rais, ratio, read, readi, real, real estat, realli, reason, rebuild credit, reduc, remov, rent, insur, repay, replac, requir, rest, result, review, revolv, rid, right, same, save and, say, school and, second, secur, see have, seek, self, sell, servic, set, short, site, situat explain, situat have, six, sold, some credit, someth, spend, stand, start, steadi, success, such, summer, take care, take the, term, that are, that ha, that have, that the, that thi, that time, that wa, that will, that would, that you, the amount, the best, the bill, the busi, the follow, the futur, the hous, the last, the monthli, the mortgag, the new, the next, the onli, the prosper, the purpos, the reason, the remain, the rest, there are, thi money, thi time, think, thought, three year, time and, time everi, time for, time the, top prioriti, total, total expens, track, tri get, tuition, two, two year, unexpect, use consolid, use for, use help, use pay, use the, usual, vehicl, view, view list, wa not, wait, want, want pay, week, went, when, when wa, whi, whi you, who, wife', wife and, will abl, will help, will not, will paid, with credit, with thi, within, without, wonder, won't, work for, work full, work hard, work the, work with, worth, would have, would like, would use, www, year and, year now, year the, yet, you can, you for, you have, young, your help, your time Reference Park, Trevor and George Casella (2008), “The Bayesian Lasso,” Journal of the American Statistical Association, 103 (482), 681-686.

Page 64: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

64

Table A4a: Bi-grams that appeared frequently in repaid loans

p(word|repaid)/p(word|defaulted) ≥ 1.1

Bi-gram (repaid) Ratio Bi-gram (repaid) Ratio Bi-gram (repaid) Ratio reinvest 4.78 low 1.44 the interest 1.30 lend 2.25 com 1.43 guarantee 1.30 i'd 2.02 cover the 1.43 usual 1.29 lower interest 2.01 miss payment 1.41 pay for 1.29 side 1.97 share 1.41 like pay 1.29 prosper lender 1.87 earli 1.41 good credit 1.29 borrow 1.80 than 1.40 miss 1.28 invest 1.80 the cost 1.40 incur 1.28 wedding 1.79 own home 1.40 current have 1.28 student loan 1.75 car and 1.39 your consider 1.28 excel credit 1.74 off this 1.39 and current 1.28 than the 1.70 income ratio 1.38 earn 1.28 card with 1.67 use credit 1.38 schedule 1.28 graduate 1.66 consult 1.38 less 1.28 rather 1.66 apart 1.36 payment for 1.28 the balance 1.66 worth 1.36 ever 1.28 student 1.65 figure 1.36 didn't 1.28 the minimum 1.65 don't 1.36 car insurance 1.27 contribute 1.62 interest rate 1.36 mean 1.27 it' 1.57 job with 1.35 health 1.27 entire 1.56 activ 1.35 debt that 1.27 never miss 1.54 return 1.35 save 1.27 thi debt 1.54 i'm 1.35 paid for 1.27 the bank 1.54 expect 1.35 all bill 1.27 i'v been 1.53 easily 1.35 down the 1.27 card debt 1.52 quickly 1.33 debt free 1.27 risk 1.52 travel 1.33 every month 1.27 engin 1.51 more than 1.33 retire 1.27 i'v 1.49 expens are 1.32 have already 1.26 i'll 1.49 balance 1.32 excel 1.26 prosper and 1.49 pretty 1.32 have great 1.26 minimum 1.49 college 1.32 default 1.26 and plan 1.49 firm 1.32 promote 1.26 goe 1.48 big 1.32 higher 1.26 after tax 1.48 rate and 1.31 averag 1.26 thank for 1.47 august 1.31 possible 1.25 and i'm 1.47 spend 1.31 have very 1.25 minimum payment 1.47 less than 1.30 good job 1.25 have excel 1.47 instead 1.30 one the 1.25 summer 1.46 debt income 1.30 have never 1.25 bonus 1.44 the debt 1.30 off credit 1.25

Page 65: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

65

Bi-gram (repaid) Ratio Bi-gram (repaid) Ratio Bi-gram (repaid) Ratio

debt have 1.25 year monthly 1.20 term 1.16 and want 1.25 toward 1.20 past year 1.16 because have 1.25 card financi 1.20 fully 1.16 above 1.25 the remain 1.19 the other 1.16 except 1.25 time for 1.19 off debt 1.16 lower 1.25 way 1.19 credit history 1.16 lender 1.24 free 1.19 profile 1.16 about month 1.24 pay this 1.19 use pay 1.15 solid 1.24 very good 1.19 addit 1.15 been pay 1.24 payment this 1.19 and would 1.15 even 1.24 revolve 1.19 gas 1.15 have good 1.24 tax 1.19 into one 1.15 debt financi 1.24 cost 1.19 picture 1.15 while 1.23 with credit 1.19 degree 1.15 the credit 1.23 under 1.18 law 1.15 higher interest 1.23 think 1.18 comput 1.15 situation have 1.23 major 1.18 last year 1.15 stable job 1.23 book 1.18 extra 1.15 too 1.23 order 1.18 credit rate 1.15 salary 1.23 plan 1.18 out the 1.15 you have 1.23 payment the 1.18 car loan 1.15 though 1.23 feel free 1.17 replace 1.15 interest credit 1.23 case 1.17 how 1.15 cover 1.23 max 1.17 look 1.14 dure the 1.23 card have 1.17 outstand 1.14 save for 1.23 ratio 1.17 through 1.14 the high 1.23 five 1.17 profession 1.14 card balance 1.23 thought 1.17 offer 1.14 step 1.22 plan pay 1.17 and are 1.14 understand 1.22 have two 1.17 although 1.14 payment and 1.22 don't have 1.17 paid full 1.14 least 1.22 half 1.17 large 1.14 always pay 1.22 year now 1.17 consolid credit 1.14 have steady 1.22 teach 1.17 experience 1.14 expense gas 1.22 dure 1.16 financ 1.14 teacher 1.21 payment have 1.16 annual 1.14 but the 1.21 have credit 1.16 require 1.14 stable 1.21 future 1.16 loan the 1.14 any question 1.21 ad 1.16 cash flow 1.14 pay down 1.21 question 1.16 paid off 1.14 the first 1.21 reflect 1.16 reason 1.13 gas util 1.21 decide 1.16 current employ 1.13 late payment 1.21 note 1.16 young 1.13 next year 1.21 fix 1.16 univers 1.13 purchase 1.20 use consolid 1.16 bit 1.13 three year 1.20 apply 1.16 add 1.13 below 1.20 buy 1.16 year ago 1.13 site 1.20 couple 1.16 loan from 1.13 happy 1.20 far 1.16 fall 1.13 income after 1.20 anything 1.16 both 1.13 already 1.20 longer 1.16 thi money 1.13 few month 1.20 together 1.16 would like 1.13

Page 66: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

66

Bi-gram (repaid) Ratio Bi-gram (repaid) Ratio Bi-gram (repaid) Ratio have stable 1.13 few year 1.11 never 1.10 best 1.13 group 1.11 extreme 1.10 should 1.13 the last 1.11 approximat 1.10 card that 1.13 next 1.11 strong 1.10 are paid 1.12 course 1.11 post 1.10 reduce 1.12 over the 1.11 rental 1.10 limit 1.12 reliable 1.11 point 1.10 flow 1.12 consider 1.11 intend 1.10 live 1.12 clear 1.11 year have 1.10 off high 1.12 feel 1.11 when was 1.10 number 1.12 remain 1.11 and pay 1.10 house and 1.12 have any 1.11 avail 1.10 june 1.12 www 1.11 never been 1.10 system 1.12 stand 1.10 small 1.10 base 1.12 something 1.10 full time 1.10 history 1.12 create 1.10 for your 1.10 have not 1.12 charge 1.10 actual 1.10 month have 1.11

Page 67: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

67

Table A4b: Bi-grams that appeared more frequently in defaulted loans

p(word|defaulted)/p(word|repaid) ≥ 1.1

Bi-gram (defaulted) Ratio Bi-gram (defaulted) Ratio Bi-gram (defaulted) Ratio payday loan 2.13 are good 1.46 father 1.31 payday 2.06 daughter 1.45 real 1.31

view list 2.03 see have 1.44 fact 1.31

god 2.02 cause 1.43 child 1.31

god bless 1.99 time every 1.43 the business 1.31

need help 1.88 worker 1.43 mother 1.31

for view 1.87 you are 1.42 surgery 1.31

top priority 1.86 and credit 1.42 hard work 1.30

bless 1.75 real estate 1.42 source 1.30

lost 1.74 follow 1.42 rebuild 1.30

the follow 1.74 estate 1.41 normal 1.30

view 1.72 left over 1.41 hard 1.30

get back 1.69 back this 1.41 expand 1.29

for prosper 1.69 location 1.40 automat 1.29

priority 1.68 pleas help 1.40 mom 1.29

promise 1.68 not online 1.40 month that 1.29

list and 1.67 prove 1.40 lesson 1.29

payment prosper 1.66 just need 1.40 tri get 1.28

prosper will 1.65 divorce 1.39 local 1.28

back track 1.64 medic bill 1.38 she 1.28

stress 1.64 for pay 1.37 know that 1.28

behind 1.64 again 1.36 this prosper 1.28

would use 1.60 very hard 1.35 bill and 1.28

loan explain 1.58 call 1.35 pay back 1.28

year 1.57 top 1.35 track 1.28

situation explain 1.56 been the 1.35 payment will 1.27

help get 1.55 hello 1.34 day 1.27 explain what 1.54 the opportunity 1.34 off some 1.27 catch 1.54 total monthly 1.34 sale 1.27 prosper payment 1.54 refin 1.34 capital 1.27 again for 1.53 have learn 1.33 advance 1.27 son 1.53 everyone 1.33 industry 1.26 what you 1.51 have over 1.32 support 1.26 explain why 1.51 rebuild credit 1.32 business and 1.26 child support 1.51 thank you 1.32 direct 1.26 someone 1.51 store 1.32 city 1.26 explain 1.50 project 1.32 husband 1.26 chance 1.50 will able 1.32 attend 1.25 and thank 1.50 equipment 1.32 for take 1.25 why you 1.49 children 1.31 relist 1.25

Page 68: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

68

Bi-gram (defaulted) Ratio Bi-gram (defaulted) Ratio Bi-gram (defaulted) Ratio assist 1.25 pass 1.21 maintain 1.17 bad 1.25 monthly budget 1.21 budget mortgage 1.17

year old 1.25 product 1.20 loan request 1.16

business 1.25 you will 1.20 place 1.16

because the 1.24 verify 1.20 person 1.16 difficult 1.24 will have 1.20 loan and 1.16 left 1.24 time monthly 1.20 the company 1.16 name 1.24 the fund 1.20 got 1.16 take care 1.24 all our 1.20 repair 1.16 took 1.24 the prosper 1.19 deduct 1.16 medic 1.24 and help 1.19 save and 1.16 were 1.24 came 1.19 improve credit 1.16 family 1.24 mistake 1.19 with the 1.16 work hard 1.24 opportunity 1.19 clean 1.15 have always 1.24 general 1.19 need pay 1.15 develop 1.23 need the 1.19 them 1.15 that you 1.23 the bill 1.19 loan for 1.15 license 1.23 taken 1.19 help pay 1.15

credit report 1.23 person loan 1.19 severe 1.15 hospital 1.23 and just 1.18 age 1.15 work with 1.23 old 1.18 and start 1.15 honest 1.23 care 1.18 list 1.15 ask for 1.23 oper 1.18 file 1.15 our home 1.23 present 1.18 the mortgage 1.15 item 1.23 and can 1.18 issue 1.15 let 1.22 pleas 1.18 america 1.15 which will 1.22 the new 1.18 loan monthly 1.15 you for 1.22 payment other 1.18 the time 1.15 custom 1.22 overtime 1.17 begin 1.15 loan pay 1.22 prosper loan 1.17 can't 1.15 that need 1.22 interest loan 1.17 can get 1.14 everything 1.22 i'm not 1.17 check 1.14 kid 1.22 off all 1.17 obtain 1.14 and was 1.21 ha been 1.17 july 1.14 report 1.21 her 1.17 review 1.14 hi 1.21 why 1.17 dollar 1.14 and need 1.21 this time 1.17 total expense 1.14 for financial 1.21 him 1.17 went 1.14 greatly 1.21 total 1.17 website 1.14 need this 1.21 mortgage rent 1.17 did not 1.14

Page 69: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

69

Bi-gram (defaulted) Ratio Bi-gram (defaulted) Ratio Bi-gram (defaulted) Ratio

who 1.13 expense total 1.12 money pay 1.10

know 1.13 loan need 1.12 score 1.10

the reason 1.13 where 1.12 bring 1.10

remove 1.13 for year 1.12 and ha 1.10

date 1.13 will allow 1.12 service 1.10

been with 1.13 once 1.12 have work 1.10

these 1.13 can see 1.12 found 1.10

contract 1.13 and get 1.12 stay 1.10 and that 1.13 loan which 1.12 oblig 1.10 self 1.13 leave 1.12 get out 1.10 what 1.13 owner 1.12 open 1.10 our credit 1.13 loan help 1.12 all the 1.10 back the 1.13 their 1.11 profit 1.10 they 1.13 turn 1.11 from the 1.10 for month 1.12 not have 1.11 leas 1.11 payment time 1.12 month for 1.11 year the 1.10 request 1.12 wonder 1.11 time the 1.10 due the 1.12 that wa 1.11 tri 1.10 deposit 1.12 gone 1.11 mine 1.10 due 1.12 did 1.11 take the 1.10 each month 1.12 give 1.11 abl pay 1.10 perfect 1.12 our 1.10 love 1.10 afford 1.12 mortgage 1.10 and now 1.10

Page 70: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

70

Table A5: Top 120 variables with the highest importance in the Random Forest analysis

Variables Importance Variables Importance Variables Importance Variables Importance

LenderRate 0.056502 SMOG 0.003126 whi you 0.002529 look 0.002311

Credit Grade A 0.041125 get back 0.003051 abl 0.002518 Gender Female 0.002301

CreditGrade HR 0.027996 daughter 0.003031 graduat 0.002515 total 0.002292 Amount Requested 0.018964 our 0.003023 never 0.002509 low 0.002280

Credit Grade E 0.012061 student loan 0.003009 famili 0.002508 bill and 0.002247

Credit Grade AA 0.010833

explain what 0.002992 will abl 0.002507

prosper loan 0.002226

Borrower Homeownership

0.009590 start 0.002965 colleg 0.002505 you for 0.002223

Credit Grade D 0.007955 behind 0.002953 for year 0.002503 son 0.002221 Prior Listings 0.007312 due 0.002948 gas 0.002469 pay thi 0.002218

payday loan 0.005989 interest rate 0.002919 last 0.002446 report 0.002217

Credit Grade C 0.005112 view list 0.002909 medic 0.002440 paid off 0.002210 Far West 0.005093 lend 0.002902 fund 0.002436 old 0.002208

busi 0.005073 Spell Checker 0.002831 what 0.002431 list 0.002200

Middle East 0.005069 card debt 0.002821 tri 0.002412 Rocky Mountain 0.002183

borrow 0.004620 again 0.002792 loan and 0.002404 even 0.002171 invest 0.004574 Age 0.002774 live 0.002397 use thi 0.002169 than 0.004494 whi 0.002758 they 0.002394 god 0.002164 Debt To Income 0.004327 them 0.002754 mortgag 0.002374 compani 0.002163 # of words in title 0.004297 balanc 0.002734 pay for 0.002370 ha been 0.002161 % words with 6 or more letters 0.004277 with the 0.002706 reinvest 0.002370

have never 0.002150

hard 0.004270 estat 0.002705 who 0.002368 the balanc 0.002148 Race - White 0.004196 what you 0.002687 and will 0.002366 these 0.002148 thank you 0.004131 you are 0.002673 real estat 0.002349 see 0.002116 person 0.004131 pay back 0.002654 into 0.002346 the time 0.002115 payday 0.004053 pleas 0.002644 more than 0.002341 know 0.002113 # of words in Description 0.003848 back thi 0.002618 just need 0.002339 rather 0.002109

explain 0.003794

Race - African American 0.002569 purchas 0.002335 promis 0.002103

Gender - Male 0.003756 score 0.002556 total monthli 0.002334 hi 0.002093

save 0.003476 Plains Regions 0.002548 plan 0.002328 support 0.002090

student 0.003193 and the 0.002537 husband 0.002315 give 0.002085

Page 71: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

71

Table A6: Summary statistics of dataset of all loan requests (n = 122,479)

Variables Min Max Mean SD Freq. Amount requested 1,000 25,000 7,411.1 6,189.4 Debt-to-income ratio 0 10.01 .54 1.33 Lender interest rate 0 .350 .196 .092 Number of words in description 1 782 171.6 122.96 # Prior Listings 0 67 0.90 2.06 Credit grade: AA 0.026 A 0.034 B 0.055 C 0.105 D 0.160 E 0.181 HR 0.436 Loan status (1 = Funded, 0 = Expired) 0.150 Loan image dummy 0.498 Home owner dummy 0.357

Page 72: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

72

Table A7a: Bi-grams that appeared frequently in funded loans

Bi-gram (funded) Ratio Bi-gram (funded) Ratio Bi-gram (funded) Ratio reinvest 4.70 com 1.73 averag 1.53 relist 3.36 below 1.71 off with 1.53 prosper lender 3.36 for prosper 1.71 add 1.53 excel credit 2.54 lender 1.71 review 1.53 prosper and 2.51 consult 1.70 remain 1.53 total expens 2.47 post 1.70 properti 1.53 bid 2.46 loan thank 1.69 list and 1.52 thi prosper 2.36 loan request 1.69 entir 1.52 group 2.27 miss payment 1.68 expect 1.52 feel free 2.24 fund 1.67 cover 1.52 invest 2.22 fulli 1.67 than the 1.51 card balanc 2.21 earli 1.66 i'll 1.51 revolv 2.19 previou 1.66 solid 1.51 question 2.17 www 1.65 addit 1.50 dti 2.12 have over 1.64 prosper will 1.50 have excel 2.11 public 1.64 spend 1.50 after tax 2.10 grade 1.64 share 1.49 verifi 2.09 intend 1.64 the minimum 1.49 ani question 2.08 and plan 1.64 annual 1.49 lend 2.07 can see 1.63 base 1.48 america 2.04 prosper payment 1.62 your consider 1.48 from prosper 2.03 have never 1.62 the cost 1.48 for consid 2.03 develop 1.62 never been 1.48 cash flow 2.02 late payment 1.62 tax 1.48 prosper loan 2.01 line 1.62 price 1.48 i'd 1.96 excel 1.61 comput 1.47 flow 1.95 the balanc 1.61 request 1.47 rental 1.94 plan pay 1.61 capit 1.47 equiti 1.92 monthli incom 1.61 paid full 1.47 cover the 1.89 expens are 1.60 site 1.47 bonu 1.89 you have 1.60 you can 1.47 origin 1.87 see have 1.60 avail 1.47 list 1.87 not includ 1.59 summer 1.46 incom after 1.86 lower interest 1.58 wed 1.46 misc 1.82 delinqu 1.58 system 1.46 answer 1.81 the remain 1.58 experi 1.46 record 1.81 firm 1.58 travel 1.45 contribut 1.81 easili 1.58 and thank 1.45 thank for 1.80 figur 1.57 cost 1.45 balanc 1.80 approxim 1.57 payment thi 1.45 borrow 1.79 side 1.56 plan 1.45 the prosper 1.79 usual 1.56 for your 1.45 card with 1.77 cash 1.56 worth 1.45 project 1.77 commun 1.56 minimum payment 1.44 with prosper 1.76 profession 1.56 schedul 1.44 engin 1.75 left over 1.55 earn 1.44 default 1.75 univers 1.55 for view 1.44 gross 1.74 case 1.54 student loan 1.44 never miss 1.74 minimum 1.54 cell 1.43 rather 1.74 valu 1.53 term 1.43

Page 73: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

73

Bi-gram (funded) Ratio Bi-gram (funded) Ratio Bi-gram (funded) Ratio

again for 1.43 follow 1.34 the amount 1.28 replac 1.43 account and 1.34 incom from 1.28 ratio 1.42 salari 1.34 toward 1.28 activ 1.42 decid 1.34 sale 1.27 profil 1.41 consider 1.34 abov 1.27 save 1.41 down the 1.34 paid off 1.27 requir 1.41 detail 1.34 payment the 1.27 loan becausei 1.41 amount 1.33 thank you 1.27 loan from 1.41 june 1.33 such 1.27 gener 1.41 the bank 1.33 deduct 1.27 enjoy 1.41 graduat 1.33 client 1.27 hello 1.40 for take 1.33 from the 1.27 rate and 1.40 low 1.33 free 1.27 the first 1.40 websit 1.33 between 1.27 lower 1.40 loan credit 1.33 take the 1.27 view list 1.40 save and 1.33 the next 1.26 number 1.40 alreadi 1.32 higher interest 1.26 debt incom 1.40 loan with 1.32 ga 1.26 have ani 1.40 top prioriti 1.32 each 1.26 been late 1.39 off thi 1.32 teacher 1.26 guarante 1.39 etc 1.32 short 1.26 first 1.39 quickli 1.32 less 1.26 see 1.39 your 1.31 higher 1.26 view 1.39 than 1.31 member 1.26 you for 1.39 paid for 1.31 have veri 1.26 immedi 1.38 each month 1.31 more than 1.26 bank 1.38 book 1.31 extrem 1.25 profit 1.38 combin 1.31 account 1.25 credit histori 1.38 includ 1.31 use credit 1.25 student 1.38 juli 1.31 limit 1.25 will paid 1.38 total 1.30 histori 1.25 have credit 1.37 contact 1.30 emerg 1.25 inform 1.37 interest rate 1.30 offic 1.25 groceri 1.37 never 1.30 level 1.25 car insur 1.37 custom 1.29 last 1.25 have alreadi 1.37 over year 1.29 ani 1.25 year the 1.36 goe 1.29 thought 1.25 one the 1.36 pretti 1.29 card have 1.25 wife' 1.36 the other 1.29 expand 1.25 payment prosper 1.36 electr 1.29 strong 1.24 less than 1.36 three year 1.29 the end 1.24 here 1.35 larg 1.29 charg 1.24 alway pay 1.35 the interest 1.29 i'm not 1.24 citi 1.35 automat 1.29 teach 1.24 incom ratio 1.35 should 1.29 becausei 1.24 pay down 1.35 servic 1.29 real estat 1.24 market 1.35 pictur 1.28 the follow 1.24 note 1.35 miss 1.28 complet 1.24 cell phone 1.34 increas 1.28 oper 1.24 bought 1.34 risk 1.28 payment will 1.24 reduc 1.34 sourc 1.28 late 1.24 current have 1.34 the busi 1.28 month that 1.24

Page 74: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

74

Bi-gram (funded) Ratio Bi-gram (funded) Ratio Bi-gram (funded) Ratio

dure the 1.24 top 1.18 four 1.14 close 1.24 contract 1.18 reflect 1.14 with thi 1.24 around 1.18 fact 1.14 locat 1.24 everyon 1.18 long 1.14 estat 1.24 area 1.18 rest 1.14 local 1.23 manag 1.17 remov 1.14 offer 1.23 cours 1.17 while 1.14 save for 1.23 far 1.17 wife and 1.14 room 1.23 over the 1.17 the rest 1.14 actual 1.23 the fund 1.17 the credit 1.14 sold 1.23 industri 1.17 process 1.14 bit 1.22 promot 1.17 may 1.14 half 1.22 field 1.16 class 1.14 success 1.22 deal 1.16 month ago 1.14 auto 1.22 after 1.16 well 1.13 consid 1.22 purchas 1.16 oblig 1.13 return 1.22 month have 1.16 show 1.13 use for 1.22 those 1.16 loan which 1.13 about month 1.22 almost year 1.16 sinc 1.13 payment have 1.22 the compani 1.16 doe not 1.13 within 1.21 use the 1.16 approx 1.13 next 1.21 the year 1.16 apart 1.13 i'v 1.21 current employ 1.15 that the 1.13 i'v been 1.21 your time 1.15 month for 1.13 the monthli 1.21 year now 1.15 most 1.12 mean 1.21 quit 1.15 read 1.12 prioriti 1.21 tuition 1.15 stabl job 1.12 friend 1.21 wonder 1.15 lesson 1.12 card debt 1.21 period 1.15 same 1.12 next year 1.21 continu 1.15 appli 1.12 time everi 1.21 licens 1.15 time the 1.12 owner 1.21 sell 1.15 year have 1.12 are paid 1.21 compani for 1.15 further 1.12 about year 1.20 except 1.15 debt free 1.12 the new 1.20 compani 1.15 the last 1.12 employe 1.20 drive 1.15 the time 1.12 extra 1.20 major 1.15 expens and 1.12 leas 1.20 benefit 1.15 into the 1.12 per month 1.20 been employ 1.14 interest credit 1.12 year with 1.20 reason 1.14 and i'm 1.12 per 1.19 ad 1.14 normal 1.12 happi 1.19 build 1.14 finish 1.12 for over 1.19 card that 1.14 three 1.12 thi year 1.19 car payment 1.14 that thi 1.11 water 1.19 big 1.14 seek 1.11 real 1.19 think 1.14 dure 1.11 work the 1.19 perfect 1.14 posit 1.11 commit 1.19 thi debt 1.14 retir 1.11 veri respons 1.19 both 1.14 young 1.11 loan off 1.19 off the 1.14 everi month 1.11 grow 1.19 two year 1.14 august 1.11 refin 1.18 payment for 1.14 time have 1.11

Page 75: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

75

Bi-gram (funded) Ratio Bi-gram (funded) Ratio Bi-gram (funded) Ratio

the high 1.11 doe 1.10 their 1.10 own 1.11 elimin 1.10 store 1.10 unexpect 1.11 bill time 1.10 two 1.10 few month 1.11 product 1.10 for our 1.10 feel 1.11 good credit 1.10 prior 1.10 down 1.11 incom and 1.10 mortgag 1.10 within the 1.11 last year 1.10 coupl 1.10 result 1.11 almost 1.10 attend 1.10 the same 1.11 own home 1.10 the mortgag 1.10 possibl 1.11

Page 76: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

76

Table A7b: Bi-grams that appeared frequently in unfunded loans

Bi-gram (unfunded) Ratio Bi-gram (unfunded) Ratio Bi-gram (unfunded) Ratio situat explain 4.18 debt financi 1.70 and help 1.41 loan explain 4.12 budget mortgag 1.68 god 1.41 explain what 3.90 off all 1.68 stress 1.40 explain whi 3.88 help get 1.67 like pay 1.40 whi you 3.84 pleas help 1.67 that can 1.40 what you 3.74 payment other 1.63 get thi 1.40 for financi 3.69 veri hard 1.63 son 1.38 for pay 3.55 take care 1.62 into one 1.38 are good 3.50 off debt 1.62 loan pay 1.38 explain 3.49 hard 1.62 afford 1.38 you are 3.26 divorc 1.62 back the 1.38 back thi 3.09 mother 1.60 money pay 1.38 catch 3.06 need thi 1.60 get out 1.38 you will 2.58 medic bill 1.60 off credit 1.37 get back 2.54 honest 1.60 medic 1.37 back track 2.53 can't 1.59 payday 1.37 pay back 2.44 tri 1.58 job and 1.37 loan monthli 2.35 good job 1.57 loan would 1.36 someon 2.29 and just 1.56 the bill 1.36 loan for 2.28 rent insur 1.55 loan consolid 1.36 use thi 2.27 need pay 1.51 expens total 1.36 behind 2.26 and need 1.51 child support 1.36 chanc 2.24 clean 1.5 help pay 1.35 just need 2.21 can get 1.5 got 1.35 whi 2.21 ahead 1.49 have had 1.35 off some 2.01 children 1.48 famili 1.35 tri get 1.97 thing 1.48 better 1.34 worker 1.97 prove 1.48 have learn 1.34 one payment 1.97 surgeri 1.48 rebuild credit 1.34 loan need 1.90 everyth 1.47 mistak 1.34 track 1.89 mom 1.47 given 1.33 bill and 1.86 kid 1.46 singl 1.33 need help 1.84 mortgag rent 1.45 our credit 1.33 what 1.79 daughter 1.45 and want 1.32 lost 1.78 and had 1.44 debt that 1.32 bad 1.77 rebuild 1.44 clear 1.32 dont 1.73 want pay 1.43 went 1.32 and start 1.73 work hard 1.43 child 1.32 payday loan 1.73 can pay 1.43 debt and 1.32 and get 1.71 hard work 1.41 abl pay 1.32

Page 77: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

77

Bi-gram (unfunded) Ratio Bi-gram (unfunded) Ratio Bi-gram (unfunded) Ratio

when wa 1.31 and wa 1.23 car and 1.17 life 1.30 loan back 1.23 not have 1.16 give 1.30 realli 1.23 issu 1.16 help out 1.29 ask for 1.23 them 1.16 will abl 1.29 work and 1.23 could 1.16 care 1.29 all credit 1.22 the past 1.16 all debt 1.29 stand 1.22 have steadi 1.16 and now 1.29 use help 1.22 will make 1.16 and can 1.28 situat have 1.22 consolid credit 1.16 difficult 1.28 and make 1.22 year 1.15 paycheck 1.28 monthli budget 1.21 off high 1.15 will help 1.28 have made 1.21 month monthli 1.15 pay them 1.28 truck 1.21 would like 1.15 consolid 1.28 year monthli 1.21 she 1.15 turn 1.27 husband and 1.20 loan financi 1.14 hospit 1.27 improv credit 1.20 the best 1.14 start 1.27 him 1.20 sever 1.14 help with 1.27 pass 1.20 assist 1.14 seem 1.27 happen 1.20 job for 1.14 past 1.27 due 1.20 file 1.14 know that 1.26 father 1.20 and would 1.14 husband 1.26 myself 1.20 will pay 1.13 outstand 1.26 time job 1.20 and credit 1.13 caus 1.26 problem 1.19 her 1.13 want 1.26 have some 1.19 ga util 1.13 dream 1.25 all the 1.19 that have 1.13 right now 1.25 now and 1.19 say 1.13 but have 1.25 the opportun 1.19 loan payment 1.13 off and 1.24 and that 1.18 school and 1.13 know 1.24 becaus the 1.18 look for 1.13 job with 1.24 and work 1.18 find 1.13 decis 1.24 now have 1.18 all our 1.13 have one 1.24 old 1.18 and also 1.13 the purpos 1.24 repair 1.17 meet 1.13 some credit 1.24 live with 1.17 someth 1.12 work full 1.24 and pay 1.17 god bless 1.12 that are 1.24 bring 1.17 pay bill 1.12 that need 1.23 need the 1.17 card financi 1.12 all bill 1.23 gone 1.17 hi 1.12 bless 1.23 greatli 1.17 payoff 1.12

Page 78: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

78

Bi-gram (unfunded) Ratio Bi-gram (unfunded) Ratio Bi-gram (unfunded) Ratio

credit score 1.12 would have 1.11 let 1.11 order 1.12 onc 1.11 money and 1.10 work with 1.12 becom 1.11 and not 1.10 money for 1.12 obtain 1.11 loan have 1.10 direct 1.12 until 1.11 steadi 1.10 unfortun 1.12 taken 1.11 and take 1.10 pleas 1.11 and then 1.11 vehicl 1.10 that would 1.11 won't 1.11 loan the 1.10 incur 1.11 credit score 1.12 week 1.10 pay all 1.11 score 1.11

Page 79: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

79

Table A8: lists of the top 30 words with the highest relevance measure for each LDA topic

Topic: Employment and School

Relevance Topic: Interest Rate Reduction

Relevance Topic: Expense Explanation Relevance

work -0.42678 debt 1.12548 expens 0.97497 job -0.42930 interest 0.98818 explain 0.79611 full -0.86215 rate 0.95097 cloth 0.77201 school -0.89078 high 0.93643 entertain 0.75424 year -1.07493 consolid 0.87551 cabl 0.75197 colleg -1.09435 score 0.86722 whi 0.70915 incom -1.14446 improv 0.74883 util 0.70817 employ -1.14622 lower 0.69804 insur 0.60867 student -1.21860 balanc 0.66615 monthli 0.39539 part -1.30739 card 0.64182 cardsmi 0.18691 financi -1.33920 histori 0.62272 purpos 0.17408 steadi -1.44914 higher 0.61019 billsmi 0.15935 stabl -1.46063 payoff 0.59608 hous 0.15739 graduat -1.48860 reduc 0.53922 debtmi 0.10787 loan -1.50922 elimin 0.52508 monthmonthli 0.06338 secur -1.53972 minimum 0.51484 timemonthli -0.00063 degre -1.56564 outstand 0.51241 situat -0.00922 monthli -1.58314 low 0.50942 incomemonthli -0.02425 educ -1.59005 rid 0.50788 iam -0.02954 hour -1.60727 ratio 0.50312 loansmi -0.04908 retir -1.62923 goal 0.46530 card -0.06722 finish -1.63329 revolv 0.44719 loanmi -0.06818 veri -1.63949 refin 0.43334 paymentmi -0.08156 start -1.65941 incur 0.39694 businessmi -0.09741 repair -1.68777 oblig 0.37474 consolidationmi -0.10197 thi -1.70717 default 0.36558 loanmonthli -0.11064 wed -1.73917 faster 0.34038 debtsmi -0.12716 summer -1.76588 sooner 0.32769 incom -0.15812 career -1.78935 miss 0.32551 yearsmonthli -0.16602 buy -1.79148 apr 0.31495 cardmi -0.16639

Page 80: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

80

Topic: Business and Real Estate

Relevance Topic: Family Relevance

Topic: Loan Details and Explanations

Relevance Topic: Monthly Payment

Relevance

busi -0.72468 bill -0.80362 loan 0.02223 month -0.46588 purchas -1.22119 tri -0.97478 thi -0.23886 payment -0.62683 compani -1.24243 famili -1.06365 becaus -0.39797 paid -0.82288 invest -1.45524 life -1.13109 candid -0.47236 total -1.03183 fund -1.49916 husband -1.19515 situat -0.91190 account -1.03609 addit -1.51557 medic -1.21212 financi -0.92870 rent -1.04262 properti -1.55064 thing -1.29017 purpos -0.93624 mortgag -1.06368 market -1.56694 littl -1.34994 hous -0.98446 save -1.14949 build -1.57312 realli -1.35012 expens -1.11843 list -1.23411 cost -1.61591 care -1.37252 monthli -1.12610 everi -1.27680 sell -1.62173 give -1.38193 incom -1.55843 payday -1.29964 servic -1.62783 children -1.41594 entertain -1.84658 budget -1.32913 sale -1.64196 hard -1.43166 cloth -1.90259 report -1.33987 real -1.71844 daughter -1.44836 cabl -1.92371 bank -1.36469 alreadi -1.74378 chanc -1.45170 util -1.95598 tax -1.39207 success -1.77930 son -1.46681 insur -2.02035 includ -1.42280 open -1.77941 money -1.50048 bill -2.17176 current -1.43029 provid -1.78311 divorc -1.52915 card -2.20032 amount -1.48126 base -1.79599 move -1.54901 canid -2.73682 check -1.51827 home -1.82519 track -1.55675 ontim -3.59224 delinqu -1.55218 estat -1.83556 everyth -1.55791 honest -3.98226 amp -1.55378 profit -1.85000 abl -1.57058 thanksmonthli -4.09312 onli -1.55612 grow -1.86056 bad -1.57478 alway -4.30764 day -1.60811 offic -1.87716 mother -1.58898 vacat -4.30948 left -1.60841 run -1.88103 home -1.63274 buy -4.83280 sinc -1.69568 product -1.88769 child -1.64554 payback -4.97132 bankruptci -1.69810 store -1.89849 kid -1.67359 trustworthi -5.35212 fee -1.72871 rental -1.90456 put -1.67912 fix -5.43237 auto -1.74107 industri -1.91004 pleas -1.68211 catch -6.89838 owe -1.75320 area -1.92412 live -1.68560 track -7.09870 rebuild -1.76605

Page 81: When Words Sweat: Identifying Signals for Loan Default in ...€¦ · the amount of money they wish to borrow, and the reason for borrowing the money. However, the text they provided

81

Figure A1: LDA analysis – selecting the number of topics based on perplexity

Note: we measure perplexity as 𝑝𝑒𝑟𝑝𝑙𝑒𝑥𝑖𝑡𝑦 = − s(t)uvwE@vxtvyz^

, where 𝐿(𝑤) is the log-likelihood of the test data documents. Thus, perplexity is decreasing in likelihood (lower perplexity means better fit).

285

290

295

300

305

310

315

320

2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30

Prep

lexi

ty

# of topics