whitepaper - ready reckoner: probability of default modeling

9
Comparison of Credit Scoring Models for Probability of Default Estimation White Paper by Rahul Dutta

Upload: bridgei2i-analytics-solutions

Post on 16-Jul-2015

474 views

Category:

Data & Analytics


3 download

TRANSCRIPT

Page 1: Whitepaper - Ready Reckoner: Probability of Default Modeling

Comparison of Credit Scoring

Models for Probability of

Default Estimation

White Paper by Rahul Dutta

Page 2: Whitepaper - Ready Reckoner: Probability of Default Modeling

Fierce competition amongst the banking and other financial sectors, as well as the recent global financial crisis and the subsequent new regulatory environments, have brought modelling credit scoring of business and personal loans to prominence. An accurate estimation of the credit risk associated with customers has become paramount, as this information assists financial institutions in deciding whether to grant credit to their customers. As the demand for credit products rapidly increases and lenders consistently face potential financial losses due to customers who are likely to default, it is important for lenders to identify the main risk factors contributing to the probability of default, as well as to predict the Probability of Default (PD) as accurately as possible. Several modelling techniques are available for this analysis. In this paper we are going to compare different credit scoring models for Probability of Default and Loss Given Default (LGD) estimation.

Executive Summary

Probability of Default and Loss Given Default analysis:

Probability of Default/Loss Given Default analysis is a method used by generally larger financial institutions to calculate expected loss. A probability of default is already assigned to a specific risk measure, per guidance, and represents the percentage expectation to default, measured most frequently by assessing past dues. Loss Given Default measures the expected loss, net of any recoveries, expressed as a percentage and will be unique to the industry or segment.

When combined with the variable Exposure at Default (EAD) or current balance at default, the expected loss calculation is deceptively simple:

While the equation itself may be simple, deriving the variables requires in-depth analysis. PD and LGD represent the past experience of a financial institution but also represent what an institution expects to experience in the future. Expected loss being a function of EAD,PD and LGD depends on the estimates of these. EAD, PD and LGD can be estimated using various techniques and at different level. Estimation of EAD or exposure at default can be done using simple OLS techniques or Survival Analysis while LGD or Loss Given Default can be predicted using either a regression model or decision trees. The most important part of this calculation is the estimation of Probability of Default. The common methodology followed to estimate probability of default (PD) is Logistic Modelling which predicts whether a customer will default payment of a particular debt. For example, if a bank provides Auto Loans to its customers, this method will be able to predict the probability of a particular customer to be a defaulter for that particular loan and given that a particular customer is a defaulter, LGD models will help to identify the loss amount for the Bank.

Now, estimation of all these components can be done at different levels. While Probability of Default can be calculated for every customer, LGD and EAD can be calculated at an aggregated level. E.g after calculating the probability of default of individuals, EAD and LGD can be calculated for different time periods or for group of individuals with same loan amount or same FICO score etc. So, finally expected loss is calculated at an aggregated level.

Page 3: Whitepaper - Ready Reckoner: Probability of Default Modeling

Instead of calculating Expected Loss at an aggregated level, we can also calculate them for individuals in the following way.

Let us assume we want to calculate total expected loss for the next one year from an auto loan given by a Bank to its customers. Assume for the i- th individual PD is the probability of default within t months, EAD it it

is the exposure at default and R is the recovery rate for the individual. So, for individual i at time t, expected it

loss is PD * EAD *(1- R ). So, expected loss within next one year for an individual is as follows:it it it

Clearly, this method gives us the expected loss values for each individual. Estimation of EAD and R can be it it

done in the following way. EAD can be expressed as a function of I and t. E.g. if for an individual total loan amount is $100 with an it

interest of 10% and tenure of one year with a monthly premium of $11. Now if after 3 months the individual becomes defaulter, Exposure at default (EAD) is $67.

R can be calculated also in a similar fashion with some financial values available which are functions of it

'I' and 't'.

Probability of Default for the individual or PD can be predicted in several ways. Following are the techniques it

to calculate Probability of default (PD).

Logistic Regression Model Structure:

The Logistic Regression takes the following form:

Where, 'p' is the probability of the event occurring, and 'K' independent variables; 'x' each are weighted by a coefficient: 'β'

The above equation can be written as:

Interpretation: In logistic regression, a change in one factor changes the risk by an amount that is proportional to the level of the other factors.Data used: To identify defaulters by this method, data that is being used is usually an account level data(customer level data) where at any given point of time from the past record models predict that given a customer has a certain information about the predictors, what is the chance that the customer will be a defaulter.Predictors are usually the following ones: Loan Amount Issued Asset Amount Loan to Asset ratio No. of months in books Down Payment made No of months employed FICO score Total amount of loan taken till date etc.

Page 4: Whitepaper - Ready Reckoner: Probability of Default Modeling

Disadvantage of this procedure:

Even if this method takes into account a significant amount of information and is quite capable of interpreting the coefficient of the predictors for the equation, the downside to this approach is that it takes into account closed cases only, that is, those accounts where either the loan has been paid fully/ not paid / has not paid anything for a certain period (e.g. 90 days.). That is, some accounts are deliberately considered as defaulters who have not paid any amount for a certain period of time and then the model is built in order to predict the probability of default. From an overview of this analysis we can say that the behaviour of accounts/customers of a particular loan product varies across two dimensions. Firstly, the predictor variables that are mentioned above which account for the variability between two different customers and try to quantify the probability of being a defaulter. The second dimension is time which accounts for the change in payment procedure within a particular observation (customer). This time factor brings in the concept of failure rate for a customer at any point in time.

So, it is quite clear that Logistic Modelling approach might be helpful to identify /predict the defaulters but is not able to identify when someone will default, is it after 6 months of approval of the loan or after 2 years; which upon having the answer will serve the banking institutions considerably.

To resolve this issue the approach that we can follow is a survival analysis technique which is able to take into account the time variant factors and will be able to answer the following questions:

i) Which borrower will default?ii) When will that borrower default?

The advantages of Survival analysis method over Logistic Regression for credit scoring are as follows:

i) Survival models naturally match the loan default process, ii) It gives a clearer approach to assessing the likely profitability of an applicant, and, iii) Survival estimates will provide a forecast as a function of time

We can understand this with an example,

We expect that rises in interest rates may increase the risk of an individual failing to make payments. This can be due to increased payment demands on loans and mortgages as well as outstanding credit card debt. The scenario is similar to economic indicators like unemployment index, property price etc. Since these variables are time variant, it is quite complicated to include them in a Logistic Regression setup whereas in Survival Analysis they can be easily incorporated.

Survival analysis, provides the predicted distribution of 'T' (Time to default) along with a number of other advantages:

First, it provides a consistent means of predicting probability of default within many different periods of time (e.g., 12 month default rate, 24 month default rate, etc.).

Second, it possesses an inherent mechanism for taking into consideration the most recent data. On the contrary, using Logistic Regression if one wishes to predict the probability of default within 24 months, customers joining within the past 24 months cannot be included while fitting the model.

Third, it provides comprehensive information on the predicted behaviour of 'T' via its predicted distribution.

Page 5: Whitepaper - Ready Reckoner: Probability of Default Modeling

Proportional Hazard Model:

The objective here is to model the time of default for a particular customer. Let us assume T denotes the random variable for time to default.

Let f(t) and F(t) denote the Probability Density Function (PDF) and Cumulative Distribution Function (CDF) of the time 'T' to default (T=0 corresponds to the time of approval of the loan).

The hazard function then is defined as and is interpreted as the instantaneous likelihood

of defaulting at time 't', given that the customer has not defaulted prior to time 't' . From the definition of the hazard function, it can be shown that

Let x , x , ... , x denote a set of 'M' predictor variables for an applicant, and define the predictor vector x = 1 2 M/

[x ,x , ..., x ] .1 2 M

In survival analysis, perhaps the most popular way to allow the distribution of 'T' to depend on a set of predictor variables is through a PH survival model, defined below:Denoting the hazard function for a customer with predictors 'x' by h(t; x) to indicate its explicit dependence on x, a PH survival model represents:

Baseline Function,f (t) follows any distribution from the exponential family e.g. Exponential, Weibull, Log Normal etc.0

A quantity that can be extracted from the predicted distribution is the probability that an applicant will default within the specific time period, which is what Logistic Regression produces. For example, the predicted probability of default within 24 months is simply F(24; x) which is the predicted CDF evaluated at month t=24. Conceptually, this is the area under the predicted PDF f(t; x) between t=0 and t=24 in the chart.

But the above approach does not include the impact of different economic scenarios, i.e. what will be the change in default rate of the customers joining in different periods of time? To take into account that time variability, the Time Dependent Proportional Hazard (TDPH) model is of great use.

Page 6: Whitepaper - Ready Reckoner: Probability of Default Modeling

Time Dependent Proportional Hazard Model:

To talk about the time factor in a default modelling scenario let's look at the following chart which shows the percentage of customers who defaulted within the first 9 months of the loan tenure. Three vintages were considered (i.e., customers joining in the three different quarters: Quarter 2 of 2004, Quarter 4 of 2005, and Quarter 4 of 2007). The customers in all three vintages fell into the same FICO scoring band (between 675 and 705). Hence, if one ignored market trends and attempted to predict default probability based only on the applicants' predictor variables, one could naively conclude that customers in the three vintages all have the same default probability. In reality, the figure shows that the default probability is much higher for the Q4 2007 vintage, because of the severe economic downturn in 2008.

To account for such temporal effects, one potential approach is to incorporate macroeconomic variables into the PH survival model.

For a customer that joins during month't', denote their hazard function by h(t; x,τ) to explicitly indicate its

dependence not only on x, but also on the time at which the customer joins. In the TDPH survival model, one represents,

Parameter estimation:

Maximum Likelihood estimation method can be used to estimate the parameters for the equation.

Data Structure:

Unlike Logistic Modelling procedure PH Model or TDPH model does not require data for a particular time window. For example, in logistic modelling, if we want to model the probability of default for 12 months we cannot consider any data within one year, prior to the time of data collection. TDPH/PH models do not have this problem. The basic difference between the data structure from Logistic Model to PH model is the censor variable. Suppose, the period of the data collected is from 2002 to 2008 and we want to find the probability of default for a loan with a cut-off point of 24 months. We create a binary (0/1) variable using the cut-off point – '0' being the ones where either loan tenure is over or payment has been made and '1' being the censored one where it is an open account. In addition to this, some macroeconomic factors like unemployment rate, house rate index, interest rate etc. are included. The macroeconomic factors contain the variability between the customers with loan approved in different time point but having the same kind of loan information (e.g. Loan Amount, Tenure, Credit History etc.); this loan information corresponds to the time dynamic part of the data.

Sampling Methods:

Biased sampling methods can be used to draw the training sample.

Page 7: Whitepaper - Ready Reckoner: Probability of Default Modeling

Generalized Additive Modelling Approach for Probability of default and Loss given default modelling:

There are several credit scoring methods to calculate the probability of default and loss given default like Logit Model, Divergence-Discriminant Method, Neural Networks, Proportional Hazard model etc. Most of the known methods being parametric always involve a distributional assumption to build the model; which might not always be a good choice, given the dynamic scenario of economy and customer behaviour.

To encounter that effect a semi parametric approach can be taken to incorporate those effects where the parametric part takes care of the conventional aspect of the predictors and the non-parametric part takes care of the remaining, which is not as such functional under any known distribution. A suitable modelling approach of this kind is Generalized Additive Modelling.

In statistics, a Generalized Additive Model (GAM) is a generalized linear model in which the linear predictor depends linearly on unknown smooth functions of some predictor variables, and interest focuses on inference about these smooth functions.

The model relates a univariate response variable, 'Y', to some predictor variables, xi. An exponential family distribution is specified for 'Y' (for example normal, binomial or Poisson distributions) along with a link function g (for example the identity or log functions) relating the expected value of Y to the predictor variables via a structure such as:

In the widely-used parametric models, the relationships between continuous predictor variables and the response variable are assumed to be known functional forms, even though they are mostly unknown in many empirical applications. By contrast, the semi-parametric GAM does not assume any functional forms for these relationships, but the data is allowed to determine them. The data-driven cubic B-spline (non-parametric) method is used to estimate the GAM. By doing so, the underlying true relationships between continuous predictor variables and the binary response variable may be uncovered.

Performance Measures:

Performance measures for PD, LGD models are universal. Following performance measures can be used to validate the model:

Model Validation Techniques:

Usual Model validation techniques for Logistic Regression model like KS Analysis or ROC curve can also be applied in case of a Proportional Hazard model or a Time Dependent Proportional Hazard Model.

Following is a snapshot comparison of ROC curves for the following four procedures:

Time Dependent Proportional Hazard Model Proportional Hazard Model Logistic regression Logistic Regression with a dynamic time

component

Page 8: Whitepaper - Ready Reckoner: Probability of Default Modeling

Receiver Operating Characteristic Curve:

ROC is commonly used to determine the overall classification power as well as to provide information on the performance of a model at any cut-off score point. A widely used simple analysis is to measure the performance of binary classifier system is a 2×2 contingency table of type I and type II errors. For a given cut-off point score of 0.5, if the estimated probability is over 0.5, it is classified as a default or bad loan.This is a snapshot of a ROC curve.The other measures for validating the performance of a model are,

Area Under Curve (AUC) : B+C, reflecting total accuracy of the model

Gini Coefficient : B/(A+B)

Data Structure:

Similar to Logistic Modelling technique, this method also deals with closed accounts only; where, if one wants to predict the probability of default within 12 months, data used should be at least one year preceding the time of scoring. The following traits should be captured at a customer/account level: monthly income (INC), debt-to-equity ratio (DE), the amount of loan (FND), monthly payment (MPM), and revolving credit line utilization (UTIL), year(s) of employment experience (EMP), housing ownership (HOM) and delinquency (DEL) reports within the recent history.

Application of GAM: GAM, a semi-parametric method proposed by Hastie and Tibshirani (1990), been applied to modelling bankruptcies (Berg 2007). It has also been applied to a comprehensive survey on loan recovery process of Italian Banks.

References: A time-dependent proportional hazards survival model for credit risk analysis, May 2011 - J-K Im, DW Apley, C Qi and X Shan

- Department of Industrial Engineering & Management Sciences, Northwestern University Credit Scoring With Macroeconomic Variables Using Survival Analysis, May 2007 - Tony Bellotti and Jonathan Crook, Credit

Research Centre, Management School and Economics, University of Edinburgh Survival Analysis Methods For Personal Loan Data, April 2002 - Maria Stepanova, UBS AG, Financial Services Group, Lyn

Thomas, Department of Management, University of Southampton A case study on using generalized additive models to fit credit rating scores, 2011 - Marlene Müller, Beuth Hochschule für

Technik Berlin · Nonlinear and Semi-parametric Modelling of Personal Loan Credit Scoring, August 2013 - Nithi Sopitpongstorn, Jean-

Pierre Fenech and Param Silvapullea, Department of Econometrics and Business Statistics, Monash University, Australia, Department of Accounting and Finance, Monash University, Australia.

Page 9: Whitepaper - Ready Reckoner: Probability of Default Modeling

OfficeBangalore: 389, 2nd Floor, 9th Main, HSR Layout, Sector – 7, Bangalore – 560 102 Phone: +91-80-42102154

US: 1013 Centre Road, ST # 403S, Wilmington, New Castle, DE 19805Phone: +1 858 312 1075

www.bridgei2i.com | [email protected] Facebook | Twitter | Google+ | LinkedIn: BRIDGEi2i

About BRIDGEi2i

BRIDGEi2i provides Business Analytics Solutions to enterprises globally, enabling them to achieve accelerated business impact harnessing the power of data. These analytics services and technology solutions enable business managers to consume more meaningful information from big data, generate actionable insights from complex business problems and make data driven decisions across pan-enterprise processes to create sustainable business impact. BRIDGEi2i has featured among the top 10 analytics and big data start-ups in several coveted publications.