machine learning applications in credit risk

Location:

ARPM Open Source Conference

8/13/2017

Machine Learning applications in Credit Risk

2017 Copyright QuantUniversity LLC.

Presented By:

Sri Krishnamurthy, CFA, CAP

[email protected]

www.analyticscertificate.com

http://www.analyticscertificate.com/

2

Slides will be available at: www.analyticscertificatecom/MachineLearning

http://www.analyticscertificate.com/MachineLearning

• Founder of QuantUniversity LLC. and www.analyticscertificate.com

• Advisory and Consultancy for Financial Analytics• Prior Experience at MathWorks, Citigroup and

Endeca and 25+ financial services and energy customers.

• Regular Columnist for the Wilmott Magazine• Author of forthcoming book

“Financial Modeling: A case study approach”published by Wiley

• Charted Financial Analyst and Certified Analytics Professional

• Teaches Analytics in the Babson College MBA program and at Northeastern University, Boston

Sri KrishnamurthyFounder and CEO

3


4

Quantitative Analytics and Big Data Analytics Onboarding

• Data Science, Quant Finance and Machine Learning Advisory

• Trained more than 1000 students in Quantitative methods, Data Science and Big Data Technologies using MATLAB, Python and R

• Launching ▫ Analytics Certificate Program

Spring 2018

▫ Fintech Certification program Fall 2017

• Building


http://www.analyticscertificate.com/fintech

http://www.qusandbox.com/

Credit risk in consumer credit

Credit-scoring models and techniques assess the risk in lending to customers.

Typical decisions:• Grant credit/not to new applicants• Increasing/Decreasing spending limits• Increasing/Decreasing lending rates• What new products can be given to existing applicants ?

Credit assessment in consumer credit

History: • Gut feel• Social network• Communities and influence

Traditional:• Scoring mechanisms through credit bureaus• Bank assessments through business rules

Newer approaches (FINTECH):• Peer-to-Peer lending• Lending club, Prosper Market place

10

Types of algorithms

Machine learning

Supervised Learning

Prediction

Classification

Unsupervised Learning

Clustering

11

Used to derive a relationship between dependent and independent variables

• Prediction▫ Regression

▫ Decision Trees (CART)

▫ Neural Networks

• Classification▫ Logistic Regression

▫ CART, Random Forest, SVM

▫ Neural Networks

Supervised Learning

12

Data pre-processing

Split data into Training and Testing sets

Train the model on Training data

Test the model using Testing data to evaluate model

performance

Methodology

13

• No distinction between independent variables and dependent variables

• No result labels to determine “correct” results

• Goals:▫ Data Reduction

▫ Clustering

Unsupervised Learning

14

• Partitioning Clustering▫ Starts with K –number of clusters sought

▫ Observations randomly divided to form cohesive clusters

▫ Example : K-means

• Hierarchical Agglomerative Clustering▫ Each observation is its own cluster

▫ Combine clusters two at a time to finally have one cluster

▫ Example: Hierarchical clustering using single linkage, Ward’s method etc.

Types of Clustering

15

• Tries to separate samples into K groups with a goal of maximizing between group variance and minimizing within group variance

• Requires K to be specified up front.

• Starts with K initial centroids and optimizes to minimize the criterion or till the number of specified iterations are reached.

• Suited for larger datasets

K-means

16

• Goal is to derive a dendrogram starting from each record being its own cluster

• Works well for smaller data sets

• Proximity is measured in multiple ways (more later)

Hierarchical clustering

17

How do you measure similarity between two entities ?▫ Apples and Bananas

▫ Coke and Pepsi vs Orange juice

▫ Honda Civic vs Toyota Corolla

▫ New York and Boston

• The notion of distance

The notion of distance

18

• Euclidean distance

• Cosine distance

Distance measures

19

• Manhattan distance

(Taxi-cab distance)

• Jaccard distance▫ Used to measure similarity or dissimilarity between binary and non-

binary variables

▫ http://people.revoledu.com/kardi/tutorial/Similarity/Jaccard.html

Other distance measures

http://people.revoledu.com/kardi/tutorial/Similarity/Jaccard.html

20

• Gower distance is used for calculating distances when we have mixed types of variables (continuous and categorical)

• Variables can be:▫ Quantitative (such as rating scale)▫ Binary (such as present/absent)▫ Nominal (such as worker/teacher/clerk)

• The metrics used for each data type are described below:▫ Quantitative: range-normalized Manhattan distance▫ Ordinal: variable is first ranked, then Manhattan distance is used with a special

adjustment for ties▫ Nominal: variables of k categories are first converted into k binary columns and

then the Dice coefficient is used (https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient )

Working with mixed-data

https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient

21

• Daisy : Compute all the pairwise dissimilarities (distances) between observations in the data set

• Pam: Partitioning (clustering) of the data into k clusters “around medoids”, a more robust version of K-means.

• Agnes: Computes agglomerative nesting (hierarchical clustering) of the dataset.

Support in R

23

Lending club

24

The Data

https://www.lendingclub.com/info/download-data.action

https://www.lendingclub.com/info/download-data.action

25

The Data

https://www.kaggle.com/wendykan/lending-club-loan-data

https://www.kaggle.com/wendykan/lending-club-loan-data

Variable description

• Calculate dissimilarity between observations.

• Select algorithm to group observations together

• Choose the best number of clusters

• Visualize clusters on reduced dimensions

Objective

• Partitioning around medoids (PAM) is used in this case.

• PAM is an iterative clustering procedure with the following steps:▫ Step 1: Choose k random entities to become the medoids.

▫ Step 2: Assign every entity to its closest medoid (using the distance matrix we have calculated).

▫ Step 3: For each cluster, identify the observation that would yield the lowest average distance if it were to be re-assigned as the medoid. If so, make this observation the new medoid.

▫ Step 4: If at least one medoid has changes, return to step 2. Otherwise, end the algorithm.

Selecting number of clusters

• One way to visualize many variables in a lower dimensional space is with t-distributed stochastic neighborhood embedding (t-SNE)

• This method is a dimension reduction technique that tries to preserve local structure so as to make clusters visible in a 2D or 3D visualization.

• https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding

Visualization with reduced dimension

https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding

31

Alternative Credit scoring in the news

32

Fintech being noticed by Regulators

33

• The regulatory sandbox allows businesses to test innovative products, services, business models and delivery mechanisms in the real market, with real consumers.

• The sandbox is a supervised space, open to both authorized and unauthorized firms, that provides firms with:▫ reduced time-to-market at potentially lower cost▫ appropriate consumer protection safeguards built in to new products and

services▫ better access to finance

• https://www.fca.org.uk/firms/regulatory-sandbox

Regulatory Sandboxes

https://www.fca.org.uk/firms/regulatory-sandbox

34

US Regulators catching up

Model Validation

• “Model risk is the potential for adverse consequences from decisions based on incorrect or misused model outputs and reports. “ [1]

• “Model validation is the set of processes and activities intended to verify that models are performing as expected, in line with their design objectives and business uses. ” [1]

• Ref:• [1] . Supervisory Letter SR 11-7 on guidance on Model Risk

http://www.federalreserve.gov/bankinforeg/srletters/sr1107.htm

36

Popularity of Open-source software in the enterprise increasing

37

• Financial Services customers like Capital One, FINRA, and Pacific Life are moving critical workloads to AWS

Cloud maturing

38

• Versions and packages

Challenges in adopting Open-source software in the enterprise

39

• Difficulty in replicating and reconciling differences in environments


40

• Deploying models built by Data Scientists still a problem


Data Scientists Enterprise IT

41

• The try-before-adopt model is difficult with unproven open-source solutions


42

www.QuSandbox.com


43

Quant/Enterprise use cases

• Create an environment that can support multiple platforms and programming languages

• Enable remote running of applications

• Ability to try out a Github submission/ someone else’s code

• Facilitate creation of Docker images to create replicable containers

• Create prototyping environments for Data Science/Quant teams

• Enable Data scientists/Quants to deploy their solutions

• Enable running multiple experiments concurrently

• Integrate seamlessly with the cloud to scale up computations

Use cases

44

Fintech use cases

• To demonstrate solutions to enterprises

• Create customized enterprise trials for companies that don’t permit installation of vendor software prior to procurement

• To manage quick updates

• Enable effective integration and hosting of services (REST APIs)

Use cases

45

Academic use cases

• Enable creation of course material and exercises that could be shared

• Enable students and workshop participants to focus on the data science experiments rather than environment setting

Use cases

46

Creating replicable environments

Creating and manage replicable environments (Code + software + data) in a single portal

47

Creating replicable environments

Create replicable environments (Code + software + data) through a easy point & click tool and publish to Dockerhub or manage internallyShare it with target users

48

User portal

• Run multiple experiments in pre-created environments (Code + software + data)• Deploy your own solutions• Run any Docker image or Github submission on the cloud

49

Run Jupyter notebooks and prototype applications

50

Run Rstudio and Shiny applications

51

Run any Docker application

52

Manage tasks and errors

53

User portal

• Dockerize and deploy applications on AWS in just a few steps

54

Deploy applications with ease

55

QU’s open source project – Project Mozaic

56

www.QuSandbox.com


57

www.analyticscertificatecom/MachineLearning

http://www.analyticscertificate.com/MachineLearning

Thank you ARPM and enjoy the boot camp!

Checkout our programs at:www.analyticscertificate.com/fintech

www.qusandbox.com

Sri Krishnamurthy, CFA, CAPFounder and CEO

QuantUniversity LLC.

srikrishnamurthy

www.QuantUniversity.comInformation, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be

distributed or used in any other publication without the prior written consent of QuantUniversity LLC.

58

http://www.analyticscertificate.com/fintech


https://www.linkedin.com/profile/view?id=6656253&authType=name&authToken=DaWh&pvs=pp

http://www.modelriskanalytics.com/

machine learning applications in credit risk

Data & Analytics