cover page – msin0128 finance research project 2018-19

57
Cover Page – MSIN0128 Finance Research Project 2018-19 Title of Project: A Data-driven, Quantitative Approach: Determinants of Financial Vulnerability of Bank Customers Date: 21.8.2019 Word Count: 14411

Upload: others

Post on 08-Dec-2021

0 views

Category:

Documents


0 download

TRANSCRIPT

DisserationTitle of Project: A Data-driven, Quantitative Approach: Determinants of
Financial Vulnerability of Bank Customers
Date: 21.8.2019
Bank Customers
Master thesis
London 2019
Montilla.
Abstract
This paper is concerned with finding the determining factors which impact the
probability of a bank customer being financially vulnerable. Lately, Financial Conduct
Authority (FCA) has released a lot of regulatory reports regarding vulnerability. As a
result, financial services providers need to have the tools necessary to identify vulnerable
customers in order to comply with the new FCA regulations, improve customers’
experience, and also their own reputation. Linear Probability Model (LPM) was
developed and the Ordinary Least Squares (OLS) estimation method was conducted to
estimate the model. Later, nonlinear Probit and Logit models were introduced to correct
for some of the limitations of LPM; they were estimated by Maximum Likelihood
Estimator (MLE). According to R-squared and Akaike’s Information Criterion (AIC),
Logit model turned out to fit the data the best. The study found that younger and older
customers are more likely to be financially vulnerable than middle-aged customers,
ceteris paribus, and 38 is the age when an individual is the least likely to experience
financial vulnerability.
vulnerability determinants, LPM, Logit model, Probit model, OLS, MLE
Referencing style: APA 6th version
5
Criteria/Weight Supervisor's comments
Topic, theoretical framework
Originality (S, P & O) (20%):
The columns show grades that rank from 2 (20%) – equivalent to a Fail (F) - to a 9 (90%) – equivalent to Honours Outstanding. The different categories are: 1. Fail – 2 marks (equivalent to a 20% grade) 2. Fail – 3.5 3. Marginal Fail – 4.25 4. Marginal Fail – 4.75 5. Pass- 5.25 6. Pass – 5.75 7. Merit – 6.25 8. Merit – 6.75 9. Distinction Very Good – 7.25 10. Distinction Very Good – 7.75 11. Distinction Outstanding – 8.40 12. Distinction Outstanding – 9 (equivalent to a 90% grade)
Mark: _______
6
Declaration of Authorship
By submitting this dissertation, you are confirming that you have read and understood
UCL and the School of Management statements and guidelines concerning plagiarism
and you declare that:
• This submission is entirely my own unaided work.
Wherever published, unpublished, printed, electronic or other information sources have
been used as a contribution or component of this work, these are explicitly, clearly and
individually acknowledged by appropriate use of quotation marks, citations, references
and statements in the text.
Acknowledgements
I would like to express my gratitude to my supervisor Angel Serrano Montilla for his
guidance and experience. I would also like to thank Toru Kitagawa for his indispensable
advice regarding Logit and Probit models.
7
Glossary ........................................................................................................................................... 8
2.1 Vulnerable customer defined ................................................................................................ 10
2.2 Financial Conduct Authority regulatory reports ..................................................................... 13
2.3 Representative industry reports ............................................................................................ 15
2.4 Theoretical framework & Identifying vulnerability ................................................................. 18 2.4.1 Hedonic pricing model .............................................................................................................. 18 2.4.2 Regression model for identifying vulnerability ........................................................................ 18
Chapter 3 ....................................................................................................................................... 20
3.3 Description of variables ........................................................................................................ 23
Chapter 4 ....................................................................................................................................... 27
4.2 OLS and the classical linear model assumptions ..................................................................... 30
4.3 Probit and Logit Model ......................................................................................................... 33
4.4 MLE and its assumptions ...................................................................................................... 36
4.5 Stating hypotheses ............................................................................................................... 37
5.2 Regression results ................................................................................................................ 41
8
Glossary
Please note that the following glossary list ought to be used in conjunction with the
main body of the paper where the references are properly cited.
Akaike’s Information Criterion (AIC): AIC is a statistical method frequently used to
determine how well a certain model fits the data.
Binary Response Models (BRM): A family of econometrics models whose dependent
variable is binary.
Classical Linear Model (CLM): CLM assumptions is a list of assumptions under
which linear estimation methods such as OLS attain desired properties.
Financial Conduct Authority (FCA): A government-approved regulatory body for
financial markets and financial services firms in the UK.
Hedonic Pricing Model (HPM): A theory that is based on the fact that the total price
of a product can be decomposed into the product’s characteristics.
Linear Probability Model (LPM): The most basic econometrics model whose
dependent variable is binary.
Maximum Likelihood Estimation (MLE): A method that is used to estimate the
population parameters of a given distribution by maximizing a (log) likelihood function.
Ordinary Least Squares (OLS): A method to estimate population parameters of an
econometric model by minimizing squared residuals.
The Office for Communications (Ofcom): A UK parliament-empowered regulator for
telecommunications, postal industries and broadcasting.
The Office of Gas and Electricity Markets (Ofgem): A regulator of electricity and
gas markets in the UK.
9
Chapter 1
1. Introduction
In the UK, the issue of how to identify and treat vulnerable customers in the
financial services industry has come under a lot of scrutiny in recent years. There are
many different variations of how a customer can be vulnerable, however, due to the
limited data available for commercial or academic purposes, this paper focuses on
financial vulnerability1 of bank customers. More specifically, we analyze the important
risk factors that determine the probability of a given customer being financially
vulnerable, make econometric predictions, and empirically test a hypothesis about
whether younger and older customers are more likely to be financially vulnerable than
middle-aged.
Doing research on vulnerability is crucial nowadays as the number of vulnerable
customers is growing each year. FCA claims that there are around 50% UK adults
exhibiting one or more signs of potential vulnerability (Financial Conduct Authority,
2018b), yet banks and other financial services providers are merely at the beginning stages
of understanding and addressing this problem (Lending Standards Board, 2016).
As of this date, there are no academic papers published on how to quantitatively
identify financial vulnerability by the use of econometrics models such as LPM, Probit or
Logit. We will attempt to fill up this gap in the literature. The second contribution stems
from the rather unique dataset that this paper is utilizing to estimate the mentioned
models. We will talk about the data cleaning, processing and transformation in Chapter
3: Data.
This paper is structured as follows: After this first chapter, the second chapter is
presented which is about the literature review on vulnerable customers in the UK. Chapter
3 is concerned with data collection, cleaning, processing and transformation of key
variables. Also, dependent and explanatory variables are defined and explained. In the
fourth chapter, three binary response models (LPM, Probit and Logit) are presented,
alongside, with the estimation methods (OLS, MLE). Moreover, hypotheses are
developed and discussed. Chapter 5 is about the results of the estimated models and
1 Note that from Chapter 3 onwards, we occasionally omit the adjective financial from the term financial
vulnerability in order not to repeat ourselves. Therefore, when we write vulnerable, we automatically mean
financially vulnerable.
interpretations are provided. Further, the findings from the hypotheses testing are
presented as well. In chapter 6, we conclude and suggest possible further research.
Chapter 2
2. Vulnerable customers in literature
In this chapter, we carry out a literature review on the concept of vulnerability
amongst customers in the UK financial sector.
Firstly, customer vulnerability is defined and some basic statistics regarding the
importance of focusing more on vulnerable customers is presented. Secondly, we shed
some light on the most significant regulatory reports which were published by the
Financial Conduct Authority (FCA) in the past few years. Thirdly, a discussion on
representative industry reports and academic literature about vulnerability follows.
Finally, we scrutinize the Hedonic Pricing Model (HPM) and justify why the choice of
our regression models were inspired by HPM to represent the dynamics of predicting
vulnerability in the econometrics and statistical sense.
2.1 Vulnerable customer defined
The concept of vulnerable customers, consumers, or clients have been around ever
since the first society originated. Only has it been in recent years though, when the first
attempts to define the concept began in academic literature and regulatory reports. In the
next few paragraphs, we will see that there is no single definition of vulnerability, nor are
the definitions robust enough to allow for only one interpretation.
Being vulnerable simply stands for the possibility of getting hurt emotionally,
mentally or physically. Vulnerable individual is more susceptible to being attacked or
influenced (Cambridge English Dictionary, 2019). This modern dictionary definition of
the concept of vulnerability is already rather broad and spans over many different
instances of human interactions.
Since this paper is about vulnerable customers, we need to have a working
definition for this notion as well. In the Occasional Paper 8, FCA defines a vulnerable
customer as: “someone who, due to their personal circumstances, is especially
susceptible to detriment, particularly when a firm is not acting with appropriate levels of
care.” (Coppack et al., 2015). Another UK regulator, The Office for Communications
11
(Ofcom), explains customer vulnerability less broadly, concentrating more on the
implication of the term for the market which it regulates – it sees vulnerable customers as
individuals who cannot participate in the communication markets due to their
characteristics such as low income, age or location (Ofcom, n.d). Ofgem, energy
regulator, thinks about vulnerability similarly as Ofcom. The only difference is that
Ofgem is defining vulnerability in the context of energy markets – a vulnerable customer
is a person who has a difficulty representing his or her interests in the electricity and gas
markets (Ofgem, 2013).
From the last two examples, it is apparent that FCA’s definition of vulnerable
customers is meant to be altered based on a particular industry or firm as customers might
be vulnerable in different ways.
Vulnerability can be permanent, temporary, hidden; it can be static or dynamic;
and it has many different forms which are overlapping (Financial Conduct Authority,
2019; Lending Standards Board, 2016; Coppack et al., 2015). Due to these reasons, since
2015, many firms in the UK have adopted the FCA definition with slight adjustments in
terms of narrowing the definition down depending on their business model (Just Group,
2019a).
In our opinion, no matter how a company defines vulnerability within their own
business, the message from FCA is clear; the definition has to reflect the fact that
companies should not inflict harm on customers and some customers might be more
vulnerable than others. Our study will also follow this guideline.
As a part of this subsection, we want to discuss the concept of financial difficulty.
Likewise, there is no single definition of this concept, however, Money Advice Liaison
Group attempts to at least assess it according to subjective and objective measures.
Subjective measures include, but are not limited to, consumer’s financial situation has
become oppressive; a consumer is not able to assess his or her financial situation;
consumer regards his or her financial situation as uncontrollable. Objective measures may
include volatile or low income; arrears on utility bills or credit; a high level of income
spent on repaying the debt (Money Advice Liaison Group, 2015). The term financial
difficulty and its connection to how it can create or deteriorate vulnerability is a matter of
the utmost importance for this thesis due to the nature of the data we have available for
empirical testing. Our dependent variables of choice are binary variables which represent
whether or not a given customer is in default, unemployed and vulnerable. Since most of
the causes of various types of vulnerability are also the effects (Coppack et al., 2015;
12
Money Advice Liaison Group, 2015), we argue that financial vulnerability, which stems
from financial difficulty, manifests itself in other types of vulnerability as well, making
it a decent proxy for our research. There will be a more detailed discussion about this
topic in Section 2.4 of this chapter, as well as, in Chapter 3: Data.
To reach a broader understanding and appreciation for studying vulnerable
customers, some recent statistics in relation to vulnerability are presented in Table 2.1.1.
Table 2.1.1: UK’s statistics related to vulnerability
Reference Organization Statistics
Hollings et al.
(2018) Barclays and GfK
The percentage of potentially vulnerable customers in case of a life-
changing event, low financial resilience, health and low financial capacity
are 55%, 48%, 20% and 8% respectively.
Financial Conduct
Authority (2018b) FCA
According to the Financial Lives Survey 2017, the percentage of the UK
adult population exhibiting one or more signs of potential vulnerability is
50%.
UK
In the UK, the number of new cases of cancer was approximately 360,000
in 2015 and was expected to increase.
ONS
Office for
National Statistics
By 2066, the UK population of 65 years old or more is predicted to
increase to 20.4 million composing of 26% of the total population, which
is a 6% increase from 2016 (ONS, 2018). One in eight male workers of old
age has unpaid caring responsibilities (ONS, 2019).
Mental Health
Foundation, n.d.;
Dyslexia
In the UK, approximately 1.5 million people live with some form of
learning disability. For instance, dyslexia affects up to one in ten people
and occurs on a continuum – some estimates claim one in five.
Small Business
Prices (2019)
Small Business
Prices
The uncertainty around Brexit has so far caused roughly 224,000 people to
become vulnerable due to job loss in the UK.
Note: Due to a different methodology, statistics from Barclays’ commissioned study are not necessarily comparable to the ones found in the FCA’s Financial Lives
Survey 2017 (Hollings et al., 2018; Financial Conduct Authority, 2018b).
From the statistics, it is apparent that the scale of vulnerability is extensive and
seems to be increasing. Despite the FCA’s claim that roughly half of the population is
potentially vulnerable might be an exaggeration, firms should still invest considerable
resources into identifying and supporting vulnerable customers. Based on the available
literature, this indeed appears to be the case as Lending Standards Board (2016) found
some evidence that the situation of how firms treat vulnerable customers has been
improving.
13
2.2 Financial Conduct Authority regulatory reports
In this section, we will summarize the most recent regulatory reports regarding
vulnerability, see Table 2.2.1. Note that this is not an exhaustive list but rather
representative. We consider a report to be representative under the condition that the
phrase vulnerable customer appears frequently enough or when a whole chapter in the
report is dedicated to that phrase.
Table 2.2.1: FCA regulatory reports about vulnerable customers
Reference Nature of the message Content
Financial Services
Authority (2011)
regulation
FSA said that it is their priority to get a fair deal for all
customers. They pointed out that the financial market lacks the
will to serve customers in vulnerable circumstances.
Furthermore, there is a visible absence of consistency when
dealing with vulnerable customers.
the consumer experience
delivered his speech on the theme of consumer experience. About
one-fifth of the whole speech was dedicated to vulnerability. He
first described what vulnerability is. Then he gave reasons of why
FCA cares. After which he told stories of two vulnerable
customers and finishing by calling on all financial services firms
to create strategies which would address the needs of vulnerable
customers. This speech was the breaking point in the UK from
which vulnerability became a more common topic for a scrutiny.
Rowe et al. (2014) FCA’s commissioned study
– Vulnerability exposed
interviews, group discussions, case studies – on behalf of FCA.
They observed that firms find it difficult to balance empowering
and protecting customers. Furthermore, inadequate behavior of a
firm can create or worsen financial difficulty, causing
vulnerability. ESRO recommends that all firms should focus on
identification, better clarity of their products, and building trust
with their customers.
Coppack et al.
– Consumer vulnerability
As of this date, this has been the most influencing regulatory
report on vulnerability written so far. FCA argues that many
financial products have been developed for a so-called average or
typical consumer without taking into account anomalies which
are more common than originally believed. The first objective of
the report was to broaden understanding and begin discussions
about vulnerability. The second objective was to help firms to
develop and implement vulnerability strategies. Also, the well-
known definition of vulnerability that has been adopted by many
financial services providers can be found in this report.
Financial Conduct
Authority (2017b)
demographic – older consumers. Besides suggesting ways how to
identify and comprehend the needs of the older population, it also
analyses financial exclusion which is most commonly seen
amongst the elderly. The research points out that whilst older
customers does not have to be necessarily vulnerable, they are
more likely than middle-aged people to experience it due to
numerous factors such as deteriorating health, lower resilience
and financial knowledge, and life events. Age is an important
predictor of vulnerability in our model, see Chapters 3 and 4.
14
Similar to any company that writes its mission and vision
statement which determines the culture and principles of the
company, FCA set its mission and vision regarding numerous
affairs in this report. FCA’s vision includes how to best protect a
customer, legal and regulatory framework, and markets operating
for customers.
Financial Conduct
Authority (2019)
2019/2020
FCA Business plans set out the current objectives and areas of
focus for the following year. Unlike the last year’s Business plan
2018/2019 (Financial Conduct Authority, 2018a), this year’s
business plan was directed more towards Brexit and regulating
cryptoassets. Despite this, FCA mentioned that their cross-sector
priorities are further encouragement of identifying and treating
vulnerable customers and employing technology, in a sense of
creating statistical models, to do so. Most firms, charities and
government organizations have so far written guidelines on how
to identify a vulnerable customer by the means of human
interactions. In a data-driven world, the next obvious step of how
to enhance this process is the use of technology and this is what
our paper is attempting to do.
Note: This table is divided into the ‘reference’ part for bibliographic purposes, the ‘nature of the message’ part containing the name of the regulator and the name
of the report, and ‘content’ part which is a short summary of the report, our discussion and how it relates to this paper. Moreover, the information about the first
reference came from the FCA Occasional Paper 8 where FSA was described (Coppack et al., 2015).
Even though the content of Table 2.2.1 is self-explanatory, a few remarks are in
order. First, an attentive reader might have noticed that we used the name Financial
Services Authority (FSA) in the first, 2011, reference. This is not an accident; FSA has
been performing many of the financial regulatory functions in the UK until 2013 when it
was dismantled and replaced by FCA. Second, FCA Mission – Approach to Consumers
was originally published in November 2017 under the name Future Approach to
Consumers. This version was distributed to various stakeholders. After some useful
feedback was collected, the final version was published in July 2018. Lastly, FCA
publishes its business plan every year, we only included the most recent one.
Figure 2.2.2: Timeline of FCA regulatory reports
February 2011
December 2014
February 2015
September 2017
July 2018
April 2019
2.3 Representative industry reports
In this first part of the section, we will summarize the most important industry
reports and academic papers that have been written on the topic of vulnerable customers
in recent years. It is arranged in chronological order.
Seiders et al. (1998) conducted a study on whether a fair behave of a company has
an impact on customers long-term loyalty. They found that it is indeed the case, especially
amongst vulnerable or disadvantaged customers, who, on average, react to the act of
unfairness much stronger, occasionally wanting to even reciprocate with the company.
Moreover, it was found that fairness is essential when the company is service-based –
intangible assets that can be mostly found in banks. The methodology was based on a
qualitative part, as well, as a quantitative experimental phase. The limitation of the
qualitative part was, in our opinion, a high proportion of the sample allocated to students
(72%). It would be more optimal to random-sample the population so that other consumer
segments are represented as well. As a result, we cannot be completely certain whether
we can generalize the result to the whole industry. Nevertheless, the findings are still
interesting. The implication for financial services providers is to treat vulnerable
customers fairly which ought to improve long-term profitability.
Waddington (2013) discussed how well the EU law protects vulnerable customers.
The findings suggest that the ‘one size fits all’ category is inadequate for identifying and
helping all the different types of vulnerable customers. This notion is related to the
‘average/typical customer’ phrase that is so often used by FCA. A better segmentation of
different causes of vulnerability would provide a better level of protection. This working
paper was published in August 2013, only 3 months before Woolard from FCA gave his
speech in which he addressed this issue in the UK, see Section 2.2.
In the same year as FCA Occasional Paper 8 was released, Money Advice Liaison
Group (2015) published their guidelines on treating vulnerable customers fairly; staff
training; data protection; when considering ‘writing off’ debt; and more.
Lending Standards Board (2016) published their industry report in April 2016
outlining good practices and recommendations. The research found that most companies
are in the first stages of how to deal with vulnerable customers fairly. Nonetheless, they
recognize the importance of identifying and developing strategies for vulnerable
customers. This industry report shows that there was at least some small progress in
tackling vulnerability since the FCA Occasional Paper 8 was published.
16
Good Practice Guide: Meeting the Needs of Vulnerable Clients is a very well
researched 9-page summary regarding vulnerability. The discussed topics are the
regulatory framework, defining and identifying vulnerability, good practices and steps on
how to create a vulnerable customer strategy for a given company (Personal Finance
Society, 2017). Notwithstanding the fact that this industry report was written two years
ago, we believe that it is still the single best summary of what any company needs to
know about vulnerability.
Hollings et al. (2018), commissioned by Barclays, conducted a study on the
consumer views to identifying vulnerability by using consumer data. The results show
that the majority of customers (54%) are ‘very or fairly comfortable’ with banks and
building societies using their data to make recommendations tailored to their needs,
compared to other organizations where the trust level is much lower (35%). Furthermore,
the research suggests that banks should take decisive actions by direct contact when the
data shows actual financial detriment. However, for a potential financial detriment, the
findings are mixed – some customers would be glad for the bank to contact them, others
would see it as a violation of their data privacy. The same holds for potential vulnerability.
In our opinion, it is of a paramount importance for banks to invest more resources in
building better relationships, and hence trust, with their customers. Before this occurs, it
is not possible to practically implement predictive models within banks that would assist
specialized trained staff with identifying not only ‘current’ but also a potential
vulnerability. There is some evidence that Barclays took the results of this study seriously
since only 4 months after the publication, a newspaper article was released detailing the
new customer phone application that allows to stop spending with certain retailers. Even
though this application is welcomed by any customer who wants to have some control
over his or her spending, it is particularly beneficial for vulnerable customers who might
have, for instance, bipolar condition that would cause them to go on a ‘shopping spree’
(“Barclays Becomes …”, 2018; Money Advice Liaison Group, 2015). On the 11th of
December 2018, Barclays became the first bank to provide such a service. This definitely
paints a very good picture of trust and data privacy issues.
Dealing with Vulnerable Customers: The Industry Response is the most recent
industry report on vulnerability in the UK. From the conducted survey and interviews, it
is evident that companies have done a lot of progress since 2015, nevertheless, there is
still room for a considerable improvement. In particular, firms claimed that the FCA’s
broad definition of vulnerability had been useful to identify and treat all the different
17
segments of vulnerability, however, due to the large range of segments, a vulnerability
policy is often different across firms and sometimes even within one firm’s departments.
Moreover, only a few respondents said that their firm has a dedicated budget to tackle
vulnerability. The most surprising result was that companies identified less than 5% of
their customer base as vulnerable. This is a very small amount compared to the FCA’s
finding that around 50% of overall customers are potentially vulnerable. This might be
due to the fact that even in large firms only a small fraction of customers contacts their
provider, making identification more difficult (Just Group plc., 2019a; Financial Conduct
Authority, 2018b). This is where, we believe, some data-driven predictive models, could
assist the front staff. The limitations of the interviews were that the responding firms were
not selected at random but rather self-selected. Hence, the findings might be either upward
or downward biased and not representative of the whole population. To confirm the
validity of the interviews’ findings, a more representative survey was conducted as well.
In most cases, the interviews and survey results match to a reasonable extent (Just Group
plc., 2019a).
In this second part, we will have a look at news articles which have mentioned
vulnerable customers in the past three years, see Table 2.3.2 for more details. Overall,
there are signs of both progress and worsening.
Table 2.3.2: News reports
Reference News media Title
(Wallace, 2016) The Telegraph February 2016: ‘Banks ordered to treat vulnerable customers with
empathy.’
2017) Banking Newslink
February 2017: ‘Santander is first UK bank to enable customers to make
payments using their voice.’
(“The elderly …”,
2017) The Economist February 2017: ‘The Elderly cognitive decline and banking.’
(Fernyhough, 2017) Financial Times April 2017: ‘Banks accused of failing millions of vulnerable customers.’
(“Barclays
Service
December 2018: ‘Barclays Becomes First UK High Street Bank to Enable
Customers to Stop Transactions at Chosen Retailers to Give Vulnerable
Customers Greater Control Over Their Money.’
Note: To access the links for these news articles, go to Bibliography.
18
2.4.1 Hedonic pricing model
There are various applications of Hedonic pricing model (HPM), ranging from
determining restaurant average meal price to valuing properties in the real estate market
by underlying factors like the proximity of an urban forest (Yim et al., 2014; Tyrväinen,
1997).
In general, HPM can be expressed in the following way
= (1, … , ) + (1)
where is the dependent variable; (1, … , ) is an implicit function
which contains the characteristics, 1, … , , that decompose the total value; and is the
error term for the unobserved or unmeasurable characteristics (Monson, 2009; Li et al.,
2018).
As an illustration, consider the total price of an apple in a grocery store.
Commonly, there are many different types of apple with various price tags, ranging from
the cheapest to the most expensive. Had we wanted to determine the underlying reason
behind the different prices, we would have to identify the fundamental factors that explain
the overall price. For instance, the weight, color or shape of an apple are just some
characteristics that could be considered.
Once we have obtained all the important factors, there are numerous econometrics
and statistical methods to estimate the general regression equation (1). The most common
one is Ordinary Least Squares (OLS) regression method (Monson, 2009). In the ‘apple’
example, we would obtain the estimates of how much an extra unit of weight contributes
to the overall price; what are the marginal prices of different colors and so on.
2.4.2 Regression model for identifying vulnerability
Our regression model, which attempt to identify and quantify the determining
factors of vulnerability, is inspired by HPM, see Chapter 4: Methodology.
As it was for the total price of an apple, and the marginal prices of each of the
characteristic that make up the total price, we attempt to find the risk factors that influence
the probability of whether a given customer is vulnerable or not. In the literature about
19
vulnerable customers, these factors are also referred to as indicators, determinants,
characteristics, customer segments, features or components and we will use these terms
interchangeably in the remainder of this paper. Below, in Table 2.4.1, we present a list of
risk factors by FCA (Coppack et al., 2015). Although not exhaustive, the list is a good
starting point.
Risk factors
Physical disability
Being older or younger
Lack of English language skills
Non-standard requirements or credit history
Source: FCA Occasional paper 8 (Coppack et al., 2015), page 23. Note: Non-standard requirements or credit history includes, for example recent immigrant, ex-offenders, care-
home leavers, soldiers returning home.
It is important to mention that customers are often exposed to multiple risk factors.
Moreover, these factors are often correlated with each other to some extent – one factor
leading to another (Coppack et al., 2015). For example, older customers tend to suffer
more likely from a long-term illness or bereavement. Likewise, a consumer with low
literacy skills is more likely to be unemployed (Rowe et al., 2014). Due to this
interconnectedness, it is important to apply a method that would isolate each factor
separately, providing us with marginal effects. Fortunately, this is what HPM and hedonic
regression were designed to do. We will demonstrate this in more detail in Chapter 4:
Methodology and Chapter 5: Results.
This was the last section from the literature review chapter, we will dive into data
next.
20
Chapter 3
3. Data
Data are bread and butter of any empirical study. In this chapter, we first explore
the origin of our data and discuss the limitations. Second, data cleaning, processing and
transformation is performed and discussed. Third, once the dataset is prepared, both
dependent and explanatory variables are properly defined and descriptive statistics,
frequency, and correlation matrix tables are presented.
3.1 Data: source, description and limitations
Rarely does an academician find a usable, real-world dataset which would contain
characteristics about bank customers. There are many reasons, ranging from data privacy
issues to an understandable secrecy of banks. Although not ideal, we decided to use a
dataset2 that came from a Portuguese banking institution, collected between May 2008
and November 2010. Moro et al. (2011) applied this dataset to help banks to analyse the
effectiveness of their direct marketing campaigns. The aim of the research was to find out
which customer segments are the most likely to deposit money and what the contributing
factors are. This would allow banks to save money, resources and human effort,
concentrating more on customers who actually wanted to put down a deposit.
Despite the fact that the problems are rather similar in nature, the dataset was not
collected for the purpose of identifying financially vulnerable customers. Nevertheless,
we believe that the available variables are sufficient enough to build a model with a decent
predictive power, see Table 3.1.1. Note that for our purposes, we merely select a subset
of these variables. The justification, reasoning and descriptive statistics are provided in
Section 3.3.
Optimally, we would like to have as many customer attributes as possible;
especially the ones suggested by FCA, see Chapter 2: Section 2.4.2. However, these types
of data are not available for academic or commercial purposes as of this time. Suggested
further research will be discussed in the last chapter of this paper.
2 Note that the data should be reliable since the source is UCI Machine Leaning Repository:
Variable Description
Age Numeric variable – age in years of a given customers
Job
entrepreneur, student, blue-collar, self-employed, retired, technician, services, admin, unknown
Marital Categorical variable – marital status: single, married,
divorced/widowed
Default Binary variable – Credit in default: yes, no
Balance Numeric variable – average yearly balance in euros
Housing Binary variable – Housing loan: yes, no
Loan Binary variable – Personal loan: yes, no
Note that the original dataset contains 17 variables in total of which 8 are bank client data (these are the ones we listed above) and 9 are variables related to the
bank marketing campaign (these have no value to us). Further, note that the variable description list on the UCI Machine Learning Repository website is for a
different dataset: bank-additional-full.csv. This paper is using a dataset named bank-full.csv. In order to obtain the variable description, go to UCI Learning
Repository website » Data Folder » bank.zip » bank-names.txt
The second limitation stems from the fact that we do not have a random sample,
but rather a population of a particular Portuguese bank. This means that we have
characteristics of all the customers that were clients in the bank over the two-year period.
Due to this reason, generalizing the results to, for example, the whole banking industry in
Portugal is not possible. Nor is it feasible to generalize the findings to the EU or UK
banking industry. We can only argue that the underlying risk factors behind vulnerability
are universal for each country and each bank, and hence should hold, to some extent, for
UK banks as well.
The original number of observations was 79,354 but due to some missing values,
it was reduced to 45,211 (Moro et al., 2011). The dataset is in the cross-sectional format.
3.2 Data: cleaning, processing and transformation
We mentioned in the previous section that not all variables are going to be used,
in fact only a few are useful for our research. In this section, the data cleaning process is
explained, which includes the transformation of the dataset and redefining of some of the
variables, so they can be applied in the regression model. We use Python to clean, process
and transform the dataset. Jupyter is employed to present how it was done and all the
steps with detailed comments can be found in Appendix A. In this section, only the most
important aspects are highlighted.
22
The original dataset that Moro et al. (2011) used has 45,211 observations and 17
variables. After eliminating all semicolons between each entry, the first 5 observations
are presented in Figure 3.2.1.
Figure 3.2.1: Original dataset
The next step was to rename variable loan to pers, drop unnecessary variables,
remove observations that contain ‘unknown’, and replace ‘yes’ and ‘no’ with 1 and 0 in
each entry for regression purposes.
After that, we transformed two categorical variables: job and marital. Job was
redefined to merely hold two values: 1 if the customer is employed, and 0 otherwise. For
a better clarity, we renamed this variable to employed. Likewise, marital was redefined
to take only two values: 1 if a customer is married, and 0 otherwise. We called this new
variable married.
Finally, a completely new variable was defined. Vulnerable takes on two values:
1 under the condition that a given customer is unemployed or in default, or both, and 0
otherwise. Figure 3.2.2 depicts the final dataset that is used in this research.
Figure 3.2.2: Final dataset: ‘dataset_1.csv’
23
The final number of observations is 43,193, due to the reduction caused by
omitting ‘unknown’ values. See Appendix A for a detailed explanation and code.
3.3 Description of variables
The final set of all the variables used in this paper is presented below. Descriptive
and frequency tables, which are presented and discussed at the end of this section, were
created by Stata 13.
Dependent variables Default is one of the binary dependent variables in our
binary response models. It takes a value of 1 if the customer has credit in default such as
credit card or mortgage payment, and 0 otherwise. It is arguable whether or not Default
is a good proxy for vulnerability as there are some instances, for example long-term
illness, which cause a customer to be vulnerable without necessary defaulting on their
debts. On the other side of the coin, we reasoned in the literature review that all
vulnerabilities are interconnected, causing each other. It is likely that someone with a
long-term illness defaults due to perhaps losing their job or an additional spending for
medication. It is reasonable to assume that most vulnerabilities manifest themselves in
some kind of financial difficulty eventually and this is the reason why Default was chosen
as one of the proxies. The second binary dependent variable that we would like to examine
is Unemployed. This variable assumes a value 1 if the customer is unemployed, and 0
otherwise. In this paper, unemployed includes students and retired. Because of the same
arguments as above, we believe that Unemployed can be a good proxy for measuring the
probability of a given customer being vulnerable. Finally, due to the interconnective
nature of vulnerabilities, we decided to combine these two variables and create a variable
called Vulnerable – see the explanation of this variable in the previous section. We are
aware that any single regression equation cannot have more than one dependent variable.
The motivation for defining not only one but three dependent variables in this paragraph
is to run multiple regression models and see which one has the most explanatory power
with statistically significant estimates.
Age A numerical variable that indicates the age of a customer. It is expected
that different age groups are vulnerable to a different degree. Specifically, younger and
elderly could be more susceptible than the middle-age group, see Chapter 2. We will also
square this variable to see whether there is a positive quadratic relationship between Age
and the probability of being vulnerable, see Chapter 4: Methodology & Empirical
Framework.
24
Balance Balance is another numerical variable. It measures the average yearly
balance on a customer’s bank account. The unit of measurements is euros.
Education An ordinal variable representing the level of education: primary,
secondary, tertiary. Each level of education is a dummy variable equal to 1 or 0. For
example, primary takes a value of 1 if the customer has at least primary education, and 0
otherwise. Since our model will have an intercept, we would be making a mistake not to
select a base dummy variable for Education, hence primary was set as the benchmark
group. This problem is well-known in the econometrics literature and is referred to as the
dummy variable trap in models which have an intercept. Dummy variable trap is a
scenario in which a perfect multicollinearity occurs amongst two or more regressors; in
simple terms, one variable can be perfectly determined from the others (Wooldridge,
2016).
The following are the five remaining explanatory dummy variables: Housing is a
dummy variable equal to 1 if the customer has a mortgage, and 0 otherwise. Pers is a
dummy variable that takes a value of 1 if the customer has a personal loan, and 0
otherwise. Married is a dummy variable equal to 1 if the customer is married, and 0
otherwise. Default is a dummy variable that takes a value of 1 if the customer is in
default, and 0 otherwise. Employed is a dummy variable equal to 1 if the customer has a
job, and 0 otherwise. Note that the last two variables are only used as explanatory in
models that do not have them as the main dependent variable.
Table 3.3.1: Descriptive statistics
Mean Median St. Dev. Min Max
Age 41 39 11 18 95
Balance 1,354 442 3,042 -8,019 102,127
Default 0.02 0 0.13 0 1
Housing 0.56 1 0.50 0 1
Married 0.60 1 0.49 0 1
Employed 0.90 1 0.30 0 1
Vulnerable 0.11 0 0.32 0 1
Pers 0.16 0 0.37 0 1
Note: Age and Balance were rounded to the nearest whole number. Default, Housing, Pers, Married, Employed and
Vulnerable were rounded to two decimal places.
For the variables Age and Balance, the mean is larger than median which indicates
that their distributions are positively skewed. The oldest customer is 95 years old and the
25
youngest is 18 years old, which is the minimum required age to hold a bank account. The
highest annual average balance is around €102,000 and the lowest is negative €8,000.
For the dummy variables Default, Housing, Pers, Married, Employed and
Vulnerable the mean shows the proportion of the attribute in the entire dataset. For
example, there are merely 2% of customers in default compared to 98% of customers who
are not. This imbalance could be highly problematic, should we have a low number of
observations, which is not fortunately our case. Further, in some cases, this can cause the
R squared of the regression to be very low. Should this problem occur, we would talk
about it more in Chapter 5: Results & Discussion.
Table 3.3.2: Frequency table – Education
Frequency Percentage Cumulative
Note: Rounding was done to the nearest whole number.
The table above shows the number of observations in each category. The most
representative group is secondary (53%); this means that the majority of customers have
completed at least secondary education. On the other hand, the least educated customers
are the minority. Finally, roughly one-third of all customers graduated from at least one
university.
Age 1
Personal -0.01*** -0.08*** 0.08*** 0.04*** 1
Married 0.28*** 0.03*** -0.01*** 0.02*** 0.04*** 1
Employed -0.25*** -0.04*** 0.01*** 0.18*** 0.06*** 0.02*** 1
Vulnerable 0.22*** 0.01*** 0.38*** -0.17*** -0.02*** -0.06*** -0.90*** 1
Note: Numbers were rounded to two decimal places. The significance is denoted by asterisk: * p < 0.05, ** p < 0.01, *** p < 0.001
Table 3.3.5 depicts the correlation of all the variables of interest. Looking at the
correlations, it can be seen that most of them are highly statistically significant besides
26
correlation between Housing and Default. We are the most interested in correlations
between our dependent variable and the explanatory variables in the model. Glancing at
the table, it seems that a model with Vulnerable as the dependent variable could have the
highest explanatory power. This, however, needs to be confirmed in Chapter 5: Results
& Discussion where the R squared is computed.
Figure 3.3.6: Densities and scatter plots of Age and Balance grouped by Default
Looking at the scatter plot of Balance and Age, the first observation is that most
customers who experienced default have around 0 annual average balance in their bank
account. A more interesting realization is that defaults do not seem to exist, apart from
one exception, above 60 years of age. The general trend is that Balance tends to shrink
after customers are retired which is also something that could be expected. Regarding the
Age density plot, it seems to be the case that customers who are in default are more
normally distributed in terms of their age. However, even without conducting any type of
statistical testing, the distributions are far away from normal. This statement is even more
true for the Balance density plot as the kurtosis is very large and tails are extremely heavy.
27
4. Methodology & Empirical framework
In this chapter, we introduce the econometrics models, two estimation methods,
and state research questions. First, the intuition and derivation of the Linear Probability
Model (LPM) is presented in the context of our dependent and explanatory variables.
Second, Ordinary Least Squares (OLS), which is the most commonly used estimator of
LPM, is introduced, together with the classical linear model assumptions. Third, we
present the Probit and Logit models, state Maximum Likelihood Estimator (MLE), and
also talk about the assumptions. Finally, we elaborate on our hypotheses and the expected
results.
4.1 Linear Probability Model
LPM, Probit Model and Logit Model belong to a family of Binary Response
Models (BRM). They attempt to specify a relationship between explanatory variables and
a dependent binary variable, which can only take two outcomes. In this section, we are
going to provide some intuition behind LPM.
= 0 + 11 + + + (2)
Model (2) is a typical multiple linear regression model. However, since is a
binary qualitative variable – instead of continuous quantitative variable – the
interpretation changes; can no longer be interpreted as a given a one-unit , 3
keeping other variables fixed. In case of a BRM, either does not changes, changes from
one to zero, or changes from zero to one (Wooldridge, 2016).
Nonetheless, still contains a useful interpretation. To understand this
intuitively, we first take the conditional expectation, and assuming that the zero
conditional mean assumption holds, we obtain equation (3):
3 Notation: is read as ‘change in y’.
28
[|] = 0, (3)
where is a shorthand for all the explanatory variables. Second, it is important to observe
that follows the Bernoulli distribution, which means that the expected value of is
merely the probability of ‘success’, see Appendix B for further details. By ‘success’ and
‘failure’, we explicitly define a customer who is, and who is not, financially vulnerable,
respectively. It can be now seen that the expected value of is the probability of ‘success’
due to the properties of Bernoulli distribution. Therefore, equation (3) can also be written
as
( = 1) = 0 + 11 + + . (4)
Equation (4), called LPM, conveys that the probability of ‘success’ is a linear
function of the explanatory variables. In this form, the interpretation is straightforward:
measures the change in the probability of ‘success’ when we change by one unit
(Bartoszyski, 2008; Wooldridge, 2016).
Out-of-sample predictions in the context of LPM is not a hard task; after we
estimate the population intercept and slope coefficients by Ordinary Least Squares (OLS)
– see the next section for more detail – we obtain the following estimated equation:
( = 1) = 0 + 11 + + . 4 (5)
From the estimated equation (5), it is apparent that for some combination of the
explanatory variables, the predicted probability would be greater than one or less than 0;
it would not make any sense if the predicted probability that a customer is financially
vulnerable was 109% or -12%, however, the specification of LPM allows for this
possibility. Another limitation is that each explanatory variable implies a constant
marginal effect on the dependent variable. For instance, we would expect that getting an
additional master’s degree decreases the probability of being financially vulnerable,
4 Symbol: ‘hat’ is used for OLS estimates and symbol: ‘tilde’ is used for MLE estimates in this paper.
29
nevertheless, not by a constant margin; rather there should be a diminishing effect of
obtaining more degrees and LPM does not account for that. The final, third drawback of
LPM stems from the fact that heteroskedasticity is present in the error term by
construction, see Appendix B. Unlike in most cases, the good thing is that we know the
exact functional form of how the variance of the error term changes with different values
of the explanatory variables. The bad thing is that applying Weighted Least Squares
(WLS) which would normally provide a more efficient estimator, under the assumption
that heteroskedasticity is present and its functional form is known, is not advisable to use
in the case of binary dependent variables. The reason is that it requires an arbitrary
threshold on the predicted probabilities, and in most applications, this causes more harm
than good. It is acceptable to use OLS with robust standard errors to correct for the
violation of homoskedasticity assumption (Wooldridge, 2016). Due to these reasons, OLS
with robust standard errors is used in this paper.
In summary, only the first two shortcomings could be problematic. Nevertheless,
LPM works well if we take values of the explanatory variables that come close to the
average sample range (Wooldridge, 2016). Partially, we can mitigate the problem of not
having a diminishing effect by specifying the model to include quadratic relationships
where it makes sense. Nonetheless, the best remedy is to also estimate Probit and Logit
models whose predicted probabilities are bounded by zero and one. We are going to shed
some light on them in Section 4.3.
Now that we have explained the intuition behind LPM, we present the actual LPM
that we use in our study and whose results are discussed in Chapter 5: Results &
Discussion.
+ 4 + 5 + 6 + 7
In model (6), the common terminology holds; is the dependent binary
variable. 0, … , 7 are the population slopes and intercept that we want to estimate. The
error term is denoted by . Since we want to try more than one dependent variable –
vulnerable, default and unemployed – model (6) will be estimated three times for the LPM
specification.
30
Note that in Section 4.5, model (6) is augmented to include the quadratic term of
age to test certain hypotheses.
4.2 OLS and the classical linear model assumptions
In the previous section, it was briefly mentioned that in order to estimate an LPM,
the OLS estimation method is utilized. We are going to introduce this method in this
section and also discuss the classical linear model (CLM) assumptions under which OLS
is a suitable estimation method (Wooldridge, 2016). Whenever possible, the assumptions
are tested in our dataset to see whether they hold or are violated. We are aware that a
discussion about to what extent a certain assumption holds is not usually included in the
methodology part of a paper, however, we believe it makes sense to have everything about
assumptions in one chapter, especially since we have not only assumptions about OLS
but also about MLE in Section 4.4.
Firstly, we can express model (6) in the following matrix form:
= + , (7)
where ×1 is a × 1 vector, containing all 43,193 observations of . ×(+1)
is a × ( + 1) matrix, containing all the observations of all 7 explanatory variables and
a unity vector. (+1)×1 is a ( + 1) × 1 vector, containing all 7 population slopes and
the intercept. ×1 is a × 1 vector of error terms.
The OLS estimator can be written as follow (Wooldridge, 2016):
= (′)−1′. (8)
Taking the expression back to the original form, we obtain the following estimated
equation:
+ 4 + 5 + 6
+ 7.
(9)
31
In the next paragraphs, we will state and test the CLM assumptions. The properties
of OLS estimator (8) change depending on how many, and to what extent, these
assumptions hold in our data (Wooldridge, 2016).
Assumption CLM.1 Linear in Parameters This simply means that our
population model needs to be specified as linear in parameters. Note that our explanatory
variables can still be non-linear under this assumption.
Our Linear Probability Models are all linear in parameters.
Assumption CLM.2 Random Sample Should we want to have results that are
representative for the whole population, we need to have a random sample that follows
the population model.
This is not the case as we have discussed in Chapter 3: Data. Not having a random
sample of the whole population, for example all the UK banks, we cannot be certain
whether the results that we find in the next chapter hold only for our sample or also more
broadly. We can only argue that vulnerability is a universal phenomenon that does not
change over banks or even countries too much. Therefore, the results could still be
informative for the overall UK banking sector.
Assumption CLM.3 No Perfect Collinearity It has to hold that neither of the
explanatory variables is constant, nor is there a perfect linear relationship amongst the
explanatory variables. Or in other words, ×(+1) has to be a full rank matrix.
We made sure that this assumption holds by not having constant explanatory
variables and avoiding the dummy variable trap. To check whether there is strong or even
perfect collinearity, Variance Inflation Factor (VIF) can be used. The rule of thumb is that
if < 10 there should not be any issue with multicollinearity or perfect collinearity
(Kutner et al., 2005). All our LPM are within the stated boundaries.
Assumption CLM.4 Zero Conditional Mean This assumption is the most
important one with regards to unbiasedness and consistency of the OLS estimator (8). It
states that the conditional expected value of the error term is zero, given any value of the
explanatory variables. The most common causes of violation of this assumption are
measurement error, omitted variable bias and functional form misspecification. If either
of these is violated, endogeneity arises and the OLS estimator is biased and inconsistent
even for a large number of observations.
Without having an instrumental variable, there is no option to test for omitted
variable bias. The best approach is to include as many relevant explanatory variables in
the model as possible and hope that we did not exclude an important one that would be
32
correlated with the rest of regressors. Moreover, we have to hope that the dataset does not
contain measurement errors in large magnitudes. As we discussed in Chapter 3, the source
seems to be reliable and hence this should not be a problem. Also, descriptive statistics
and scatter plots did not reveal any extreme outliers.
The only type of violation of zero conditional mean assumption that we can test
is functional form misspecification. This was done by Ramsey Reset test. Unfortunately,
we rejected no functional form misspecification at 5% significance level for all LPM.
We tried to include various interactions of dummy variables with balance and age,
as well as, taking the log of balance but the test was still concluding the same result. It is
also possible that we are missing a relevant explanatory variable because functional form
misspecification is a form of omitted variable bias. At this point, the only remedy would
be to collect more relevant data, however, those are not available. Because of this, the
OLS estimator (8) is biased and inconsistent to some extent.
Assumption CLM.5 Homoskedasticity For this assumption, the variance of
the error term has to be constant given any value of the explanatory variables. The
common ways how to check whether or not the data exhibit heteroskedasticity is to
conduct the White test or Breusch Pagan test.
As we said before, LPMs are always heteroskedastic and so we have to rely on the
law of large numbers (LLN) and central limit theorem (CLT) to obtain asymptotically
valid robust standard errors. The rule of thumb is that if our dataset contains at least 100
observations, the robust standard errors should be valid (Bartoszyski, 2008). The dataset
that we are working with has 43,193 observations.
Had this assumption held, together with the previous one, the OLS estimator (8)
would have had the smallest variance amongst all linear, unbiased estimators
(Wooldridge, 2016).
Assumption CLM.6 Normality of Error term In order for this assumption to
hold, the error term has to be normally distributed with zero mean and variance 2, i.e.
~ (0, 2), and it also has to be independent of the other regressors. Informally, this
assumption can be tested by plotting the residuals and comparing the empirical histogram
with the normal distribution, see Appendix B. Formally, we can conduct the Shapiro-
Wilk test.
The histogram does not resemble normal distribution and the null hypothesis of
the Shapiro-Wilk test was rejected as well. Fortunately, with a large number of
observations, this assumption can be disregarded. However, had we had no functional
33
form misspecification, constant variance and normality of the error term, our estimator
would have had the smallest variance amongst all unbiased estimators, which is the
strongest property that OLS estimator can have (Wooldridge, 2016).
4.3 Probit and Logit Model
In Section 4.1, we saw that LPM has certain drawbacks: the predicted probabilities
can be above one and below 0, and the partial effect of explanatory variables is constant.
These limitations can be overcome by using Probit or Logit model.
To better understand the intuition behind these two models, we can rewrite model
(6) as follows:
( = 1|)
+ 7),
(10)
where is a shorthand for all the explanatory variables; (. ) is, for now, an unspecified
function; and ( = 1|) is the response probability.
We need to make sure that (. ) is a function that takes values between zero and
one to eliminate the first drawback of LPM. Second, (. ) should be nonlinear to eliminate
the second drawback (Wooldridge, 2016).
Naturally, the first functional form that comes to mind is the standard normal
cumulative distribution function (standard normal CDF) whose range is from zero to one
and domain is from minus infinity to plus infinity. Should we take the limits from minus
infinity and plus infinity, we would see that the function approaches, but never touches,
zero from the left and one from the right. This specification makes the predicted
probabilities bounded, eliminating the first limitation of LPM. With (. ) as the standard
normal CDF, model (10) can be explicitly rewritten as Probit Model:
( = 1|) = Φ() = ∫ ()

+ 5 + 6 + 7.
In Probit Model (11), Φ() is the standard normal CDF and () is the density
of standard normal.
Another functional form specification for (. ), which is used frequently and has
similar properties to standard normal CDF, is the logistic CDF. The main difference of
the distribution is that the tails are heavier – allowing for higher probability of extreme
events (Wooldridge, 2016). Model (12) shows the Logit Model:
( = 1|) = Λ() =
1 + (12)
In Logit Model (12), Λ() denotes the logit CDF and has the same specification
as before.
The main difference between Probit and Logit is that the error term has the
standard normal distribution in the former, and logistic distribution in the latter
(Wooldridge, 2016).
Because Logit and Probit models are nonlinear, the interpretation is more difficult,
and an additional assumption needs to be made in contrast to the LPM. To find the
marginal effects after estimating the models, we have to rely on calculus and an arbitrary
choice of values for the explanatory variables. The general estimated equation follows:
( = 1|)
+ 7).
(13)
For instance, if we are interested in the marginal effect of age on the probability
of being vulnerable, we take the first partial derivative with respect to age, see expression
(14). Note that unlike in LPM, the resulting expression still depends on the explanatory
variables and estimated coefficients. In order to get a solution in terms of a number, we
35
need to arbitrary set the values of the explanatory variables as their sample means
(Wooldridge, 2016). Then the interpretation is for the so-called average customer.
(=1|)
5 + 6 + 7 ) × 1 ,
(14)
In empirical papers, it is common to introduce more than one model and discuss
which one is the most optimal based on the data and application. We included the basic
LPM – introduced in the first section of this chapter – as a benchmark to which we can
compare the results from a more sophisticated binary response models and see whether
the findings differ substantially. In this section, we have discussed Logit and Probit
models and discovered that the main difference stems from a different assumption on the
distribution of the error term. Economists are more in favour of assuming that the error
term is normally distributed due to the existing economic theories, and properties of
standard normal random variables which are easier to work with. On the other hand, the
logit distribution assumption for the error term has become more popular in finance
applications to ensure a more realistic probability weight for less likely events, and, in
general, staying more conservative (Wooldridge, 2016). Since, as of writing of this paper,
there is no theory on what distribution the error term for ‘vulnerability modelling’ should
have, we decided to estimate both models and compare the results. It is also good to point
out that the estimated marginal effects of Probit and Logit Models should basically be the
same in most binomial (one and zero) applications (Wooldridge, 2016). Nevertheless,
occasionally, some major differences might occur when we examine the the tails of the
distributions (Cox, 1966). Due to this reason, we decided to employ Akaike’s information
criterion (AIC) to choose the most optimal model. There is still an ongoing research into
the evaluation of the effectiveness of numerous information criteria methods to
differentiate between Probit and Logit models, however, Chen et al. 2010 found some
evidence that AIC can be used for an unbalanced dataset that has more than 1000
observations. Therefore, besides comparing the findings of the models, we are also going
to report AIC and see which of the models fits the data better, see Chapter 5: Results &
Discussion
36
4.4 MLE and its assumptions
Probit and Logit models are nonlinear functions in parameters, and due to this
reason, they cannot be estimated by OLS. In this section, we introduce the Maximum
Likelihood Estimation (MLE) method which is used to obtain the estimated Probit and
Logit equations. Later, a discussion will be held on the basic assumptions under which
MLE is appropriate to employ.
MLE has a good intuitive appeal. In essence, it is about finding the most probable
values of the parameters that maximize the chance of observing the data’s distribution
(Wooldridge, 2016). The likelihood function conditional on the observations of our
explanatory variables can be expressed as:
(|{}) = ∏ {[(0 + 1 + 2 + 3 + 4 + 5 + 6
43193
+ 6 + 7)]1−}
0, 1 − (0 + 1 + 2 + 3 + 4 + 5 + 6 + 7 )
(15)
The MLE of the vector , denoted by , maximises likelihood function (15).
Unfortunately, due to the nonlinear nature of the maximization problem, we cannot derive
an analytical solution of by hand (Wooldridge, 2016). Stata will aid us with finding the
solution numerically by an iterative process.
The assumptions underpinning the MLE are very similar to the OLS ones.
Namely, CLM.2 Random Sample assumption has to hold if we want to generalize the
results to the whole population. Unfortunately, the dataset is the identical one and so the
conclusion is the same as before. Furthermore, for the MLE to be consistent we need a
high number of observations, and no omitted variables, no measurement error and no
functional form misspecification. As before we found that the model has functional form
misspecification and therefore the MLE is not consistent to some extent. The good thing
about maximum likelihood method is that it is based on the distribution of the dependent
variable given the regressors, where the heteroskedasticity is automatically accounted for
(Wooldridge, 2016). If our MLE was consistent and asymptotically normal, we would
also have asymptotic efficiency (DeGroot et al., 2002). Similarly as for the classical linear
assumptions, omitted variable bias cannot be directly tested.
37
4.5 Stating hypotheses
With appropriate data and econometrics model, we can empirically attempt to
answer many questions. Regarding financial vulnerability, one might be interested in
knowing whether younger and older bank customers are more likely to be financially
vulnerable than middle-aged. If true, there must be a positive quadratic relationship
between age and vulnerable, which, in turn, creates another question: at what age are
bank customers the least likely to be vulnerable.
In order to answer these two questions, model (6) was augmented to include a
quadratic term for age:
= 0 + 1 + 12 + + , (16)
where denotes the rest of the explanatory variables in model (6). Currently, the
explanatory variable of interest is age and its square term.
First, we need to know whether the coefficients are statistically different from
zero. This can be tested empirically, see expressions (17) and (18) below:
0: 1 = 0 ,
: . (18)
Expression (17) is a hypothesis stating that either age2 has no influence on
vulnerable or it does. The t-test will be employed to test this hypothesis in the next
chapter. Should the null hypothesis be rejected at 5% significance level, then we infer that
there is a parabolic relationship between vulnerable and age, under a condition that the
second null hypothesis is also rejected at 5% significance level. Expression (18) will be
tested by F-test. Moreover, should 1 > 0 and 1 < 0 occur, we can infer that the
relationship is a convex function and a point of minima can be found, see expressions
(19):
38

21 | ,
(19)
where age* denotes the turning point. If the second derivative is positive, the turning point
of the function is the point of minima, and vice versa.
Should the first hypothesis be proven right, it would confirm an FCA’s claim that
being younger or being older is a vulnerability risk factor, see Chapter 2. The second
hypothesis is of interest because every bank needs to manage its limited resources in the
most efficient way. If there is only a limited budget allocated to identifying and treating
vulnerable customers, the best course of action would be to first focus on the younger and
older customers and leaving the middle-aged customers until more resources are
available. However, this section only stated the questions and hypotheses. Whether our
expectation of this relationship is true or false remains to be seen in the next chapter where
we present the results.
5. Results & Discussion
We will only interpret and discuss estimation results for a dependent variable:
vulnerable. We encourage an interested reader to see Appendix C where estimation
results for unemployed and default are presented.
Despite the fact that it is more common to first interpret and comment on the
estimated models and then do hypothesis testing, we decided to first do the hypothesis
testing since we want to use the results in the final estimated models. Therefore, Section
5.1 is about the results of hypotheses testing and Section 5.2 discusses the estimated
models.
5.1 Hypotheses testing results
In the previous section, we asked whether there is a positive quadratic relationship
between age and vulnerable, and if so, what is the point of minima. From (19), we
obtained the following estimated model:
= 1.178 − 0.053 + 0.00072 +
(. 026) (. 001) (. 00001)
(20)
where robust standard errors are in parentheses. Both age and age2 are highly statistically
significant; the p-values are below 0.001. The same holds true for T-test whose p-value
is below 0.0001. Moreover, the signs of the estimated coefficients convey the fact that it
is a positive quadratic relationship. Therefore, we have a strong empirical evidence that,
indeed, younger and older customers are more likely to be financially vulnerable than
middle-aged.
To see this graphically, we plotted a quadratic fitted line of age and vulnerable,
while keeping other variables in fixed. Figure 5.1.1 depicts this parabolic relationship.
40
Figure 5.1.1: Scatter plot of age and vulnerable and quadratic fitted line
Now that we know that a positive quadratic relationship exists, it is, therefore,
possible to calculate the point of minima:
∗ = | 1
21 | = |
2 ∗ 0.0007 | ≈ 38 (21)
This means that a bank customer who is around 38 years old is the least likely to
be financially vulnerable; customers who are around 20 years of age are more probable
to be financially vulnerable; and with each additional year, after 38 years old, the
customer is more and more susceptible to financial vulnerability. An observant reader can
notice that there is no limit on how much vulnerable a person can be, nor does the
probability seem to stop at 100%. This is one of the limitations of the LPM that we
discussed in the previous chapter. However, for the purposes of the two questions we
asked, this model is more than sufficient, as we are not trying to test hypotheses about the
boundaries of age or vulnerable.
41
5.2 Regression results
In this final section, we are going to look at the regression results. Table 5.2.1
contains the estimated coefficients, (robust) standard errors, number of observations, R-
squared, Akaike’s information criteria, explanatory variables in the first left column, and
the names of the estimated models in the first row. Furthermore, Appendix C contains
figures which graphically show marginal effects of the estimated coefficients.
Table 5.2.1: Estimation Results of LPM, Probit, Logit for dependent variable vulnerable
LPM Probit Logit
AIC 25094 24989
Note: Dependent variable is vulnerable. Standard errors are in parentheses. For LPM, robust standard errors are reported. For Probit
and Logit, standard errors are reported. Estimates and standard errors are rounded to three decimal places, apart from balance and
age2. For LPM, the basic R-squared is reported. For Probit and Logit, Pseudo R-squared is reported. Statistical significance is denoted by * p < 0.05, ** p < 0.01, *** p < 0.001.
LPM In case of LPM, the interpretation is not difficult. Apart from age, all the
estimated coefficients are the marginal effects of the explanatory variables.
Overall, the signs of the estimates turned out to be as expected. The only one that
might be slightly counter-intuitive is the negative sign for housing; it means that a bank
customer who has a mortgage is 6.2 percentage points less likely to be financially
vulnerable, ceteris paribus. At first sight, it does not make much sense because if someone
42
has a mortgage, they are perhaps more vulnerable to fluctuations of the interest rate, the
possibility of not being able to pay instalments, or in general to the economic cycle. On
the other hand, should a bank customer obtain a mortgage, it is likely that the person has
a high credit score which means he is better prepared for difficult financial circumstances.
This effect most likely dominates and hence we have a negative sign.
All the estimates are highly statistically significant besides pers – a bank customer
who has a personal loan.
The following are the interpretations of the marginal effects: A customer who
increases her bank balance by €10,000 is approximately 2 percentage points less likely to
be financially vulnerable, ceteris paribus. A customer who graduated from high school is
2.2 percentage points less likely to be financially vulnerable compared to someone who
has only elementary education, keeping other factors fixed. A customer who has a
university degree is 6.2 percentage points less likely to be financially vulnerable in
comparison to a person who only attended elementary school, ceteris paribus. A customer
who is married is 2.7 percentage points less likely to be financially vulnerable compared
to a person who is divorced or single, ceteris paribus. Each additional year a customer is
alive, the probability of him being financially vulnerable increases by 0.41 percentage
points.5
R-squared can be interpreted as the proportion of the variation in the dependent
variable that is explained by all the explanatory variables. In social sciences such as
economics and finance, 17.4% is not the highest number but it is still reasonable. It is
important to note that low R-squared does not influence the quality of the estimated
coefficients. It merely conveys how well the estimated model fits the observations
(Wooldridge, 2016).
We talked about the limitations of LPM in Chapter 4. One of them is that the
predicted probabilities can be above one and below zero. From all 43,193 bank customers,
187 (0.042% of the whole sample) were predicted to be vulnerable more than 100% and
2,016 (4.7% of the whole sample) were predicted to be vulnerable less than 0%. The
customers who were predicted to be vulnerable less than 0% have in common that they
are around 38 years old and have a very high average yearly balance on their bank
account. The opposite holds true for customers who were predicted to be more than 100%
5 For an ‘average-aged’ customer in our sample: −0.053 × 2 × 0.0007 × 40.8 ≈ 0.41 . Note that, even for
LPM, the first derivative with respect to age is still conditional on age if we have a quadratic term. Hence, we input average age
(40.8) to get the marginal effect.
43
vulnerable – customers either around 18 years old or over 80 who have around €0 in their
bank account.
Predicting the probability of a new bank customer to be financially vulnerable can
be an indispensable tool for banks. For instance, consider a customer who is 72 years old,
has €30,000 in the bank account, has a university degree, is married, has no mortgage or
personal loan. According to the LPM, the predicted probability of this customer being
financially vulnerable is about 84 percentage points6. A bank can then set an arbitrary
threshold and ask trained staff to call those customers who are predicted to be financially
vulnerable above, for example, 85 percentage points.
Probit and Logit As we mention in the previous chapter, Probit and Logit
models are more difficult to interpret because the marginal effects are conditional on all
the explanatory variables and the estimates. Therefore, we cannot directly compare the
estimated coefficients of LPM, Logit and Probit in Table 5.2.1.
For the purposes of comparisons, Table 5.2.2 was created, see Appendix C for
code and more comments. We used the average values of the explanatory variables in the
sample to compute the marginal effects of Probit and Logit models. Though the so-called
average customer approach is not much realistic as, for example, you cannot be married
with a proportion of 0.6, nonetheless, the final marginal effects of an average customer
still provide a lot of insights.
Table 5.2.2: Probit and Logit: marginal effects for an average customer
Balance Secondary Tertiary Married Housing Pers AIC R2
Probit – 0.022*** – 0.017*** – 0.045*** – 0.027*** – 0.045*** – 0.002 25094 0.181
Logit – 0.020*** – 0.016*** – 0.042*** – 0.025*** – 0.041*** – 0.003 24989 0.185
Note: See Figure D.1.1 in Appendix C for comments on how this was computed. Balance was rescaled; it is in €10,000 in this table.
Marginal estimates are rounded to three decimal places. Statistical significance is denoted by * p < 0.05, ** p < 0.01, *** p < 0.001.
Comparing the marginal effects of LPM, Probit and Logit estimated models, we
can observe that the values are almost identical. It is consistent that for Probit and Logit,
all marginal effects are slightly lower than for LPM. For instance, in the Probit model, an
average customer is 4.5 percentage points less likely to be financially vulnerable if he has
a university degree compared to a customer who only has an elementary education, ceteris
paribus. We recall that it was 6 percentage points in LPM.
6 = 1.18 − 0.053 ∗ 72 + 0.0007 ∗ 722 − 0.000002 ∗ 30000 − 0.06 − 0.03 ≈ %
44
The R-squared for LPM, Probit and Logit are 0.174, 0.182 and 0.185 respectively.
Generally, the higher the R-squared, the better. Moreover, the AIC for Probit and Logit
are 25094 and 24989 respectively. In this case, the smaller is the AIC, the better the model
fits the data. In conclusion, based on both R-squared and AIC, we can infer that the Logit
model ought to perform the most optimally.
Chapter 6
6. Conclusion
In order to improve our understanding of the important risk factors that influence
the probability of a given customer being financially vulnerable, we obtained a dataset of
Portuguese bank customers and their characteristics. After running regressions, we
obtained the marginal effects of each characteristic. One interesting result is that,
according to the LPM estimates, a customer who has at least one university degree is 6
percentage points less likely to be vulnerable than a customer who has only elementary
education, ceteris paribus. Further, we empirically showed that younger and older
customers are more likely to be vulnerable and that a person around 38 years old should
be the least vulnerable.
The contribution lied in applying LPM, Probit and Logit models in the context of
vulnerability predictions, which, to the best of our knowledge, has not been done before.
The second contribution came from the cleaning and processing of a bank dataset which
was originally collected for marketing purposes.
The results of this paper can be used to broaden the understanding of what the
more important factors behind vulnerability are. Banks which are operating on limited
resources can perform probability predictions on all their customers and first call
customers whose likelihood of being vulnerable is, for instance, more than 85%. As the
budget allocated to vulnerability grows, this arbitrary threshold can be enlarged as well.
Moreover, since the least likely to be vulnerable customers are 38 years old, the bank can
first focus on identifying and helping younger and older customers if there are not enough
resources available to cover everyone.
The limitations can be categorized into three types in this paper. The first type is
concerned with the dataset that was used. As we mentioned before, the dataset was
originally used for different purposes and as such some of the variables are proxies rather
45
than the actual values that we would like to have. The second type is about the three
econometrics models. For example, the LPM allows some of the predicted probabilities
to be more than one or less than zero. On the other hand, Logit and Probit models correct
for this limitation but introduce problems with interpretation of the marginal effects. The
last type of the limitations is about the estimation methods. Both OLS and MLE were
shown to be inconsistent to some extent due to functional form misspecification.
Unfortunately, we do not know the magnitude of the inconsistency. We can only hope
that it is not too large to completely invalidate our results.
Further research can look into including more relevant explanatory variables in
the three models to decrease the inconsistency of the estimators. Customer’s attributes
such as physical health, disability, financial skills or income would be very useful to have.
Moreover, obtaining a random sample from many UK banks would be indispensable for
generalising the results to the whole population. Finally, there is a lot of potential for
classification exercises such as decision tree method which would order the determining
factors of vulnerability from the most to the least important and classify customers as
either vulnerable or not, instead of giving each of them predicted probability of being
vulnerable.
46
Bibliography
Age UK. (2014). Improving Later Life: Vulnerability and Resilience Older People. Retrieved from
https://www.ageuk.org.uk/globalassets/age-uk/documents/reports-and-publications/reports-and-
briefings/health--wellbeing/rb_april15_vulnerability_resilience_improving_later_life.pdf
Wiley & Sons, Inc., (2nd ed.)
Barclays Becomes First UK High Street Bank to Enable Customers to Stop Transactions at Chosen
Retailers to Give Vulnerable Customers Greater Control Over Their Money. (2018, December
11). Targeted News Service; Washington, D.C. Retrieved
from http://search.proquest.com/docview/2156738248/citation/B4F6D1E726144C4APQ/1
from https://dictionary.cambridge.org/dictionary/english/vulnerable
Cancer Research UK. (2015). Cancer Statistics for the UK. Retrieved from
https://www.cancerresea