machine learning applications in credit risk
TRANSCRIPT
Location:
ARPM Open Source Conference
8/13/2017
Machine Learning applications in Credit Risk
2017 Copyright QuantUniversity LLC.
Presented By:
Sri Krishnamurthy, CFA, CAP
www.analyticscertificate.com
2
Slides will be available at: www.analyticscertificatecom/MachineLearning
• Founder of QuantUniversity LLC. and www.analyticscertificate.com
• Advisory and Consultancy for Financial Analytics• Prior Experience at MathWorks, Citigroup and
Endeca and 25+ financial services and energy customers.
• Regular Columnist for the Wilmott Magazine• Author of forthcoming book
“Financial Modeling: A case study approach”published by Wiley
• Charted Financial Analyst and Certified Analytics Professional
• Teaches Analytics in the Babson College MBA program and at Northeastern University, Boston
Sri KrishnamurthyFounder and CEO
3
4
Quantitative Analytics and Big Data Analytics Onboarding
• Data Science, Quant Finance and Machine Learning Advisory
• Trained more than 1000 students in Quantitative methods, Data Science and Big Data Technologies using MATLAB, Python and R
• Launching ▫ Analytics Certificate Program
Spring 2018
▫ Fintech Certification program Fall 2017
• Building
6
Credit risk in consumer credit
Credit-scoring models and techniques assess the risk in lending to customers.
Typical decisions:• Grant credit/not to new applicants• Increasing/Decreasing spending limits• Increasing/Decreasing lending rates• What new products can be given to existing applicants ?
Credit assessment in consumer credit
History: • Gut feel• Social network• Communities and influence
Traditional:• Scoring mechanisms through credit bureaus• Bank assessments through business rules
Newer approaches (FINTECH):• Peer-to-Peer lending• Lending club, Prosper Market place
9
10
Types of algorithms
Machine learning
Supervised Learning
Prediction
Classification
Unsupervised Learning
Clustering
11
Used to derive a relationship between dependent and independent variables
• Prediction▫ Regression
▫ Decision Trees (CART)
▫ Neural Networks
• Classification▫ Logistic Regression
▫ CART, Random Forest, SVM
▫ Neural Networks
Supervised Learning
12
Data pre-processing
Split data into Training and Testing sets
Train the model on Training data
Test the model using Testing data to evaluate model
performance
Methodology
13
• No distinction between independent variables and dependent variables
• No result labels to determine “correct” results
• Goals:▫ Data Reduction
▫ Clustering
Unsupervised Learning
14
• Partitioning Clustering▫ Starts with K –number of clusters sought
▫ Observations randomly divided to form cohesive clusters
▫ Example : K-means
• Hierarchical Agglomerative Clustering▫ Each observation is its own cluster
▫ Combine clusters two at a time to finally have one cluster
▫ Example: Hierarchical clustering using single linkage, Ward’s method etc.
Types of Clustering
15
• Tries to separate samples into K groups with a goal of maximizing between group variance and minimizing within group variance
• Requires K to be specified up front.
• Starts with K initial centroids and optimizes to minimize the criterion or till the number of specified iterations are reached.
• Suited for larger datasets
K-means
16
• Goal is to derive a dendrogram starting from each record being its own cluster
• Works well for smaller data sets
• Proximity is measured in multiple ways (more later)
Hierarchical clustering
17
How do you measure similarity between two entities ?▫ Apples and Bananas
▫ Coke and Pepsi vs Orange juice
▫ Honda Civic vs Toyota Corolla
▫ New York and Boston
• The notion of distance
The notion of distance
18
• Euclidean distance
• Cosine distance
Distance measures
19
• Manhattan distance
(Taxi-cab distance)
• Jaccard distance▫ Used to measure similarity or dissimilarity between binary and non-
binary variables
▫ http://people.revoledu.com/kardi/tutorial/Similarity/Jaccard.html
Other distance measures
20
• Gower distance is used for calculating distances when we have mixed types of variables (continuous and categorical)
• Variables can be:▫ Quantitative (such as rating scale)▫ Binary (such as present/absent)▫ Nominal (such as worker/teacher/clerk)
• The metrics used for each data type are described below:▫ Quantitative: range-normalized Manhattan distance▫ Ordinal: variable is first ranked, then Manhattan distance is used with a special
adjustment for ties▫ Nominal: variables of k categories are first converted into k binary columns and
then the Dice coefficient is used (https://en.wikipedia.org/wiki/S%C3%B8rensen%E2%80%93Dice_coefficient )
Working with mixed-data
21
• Daisy : Compute all the pairwise dissimilarities (distances) between observations in the data set
• Pam: Partitioning (clustering) of the data into k clusters “around medoids”, a more robust version of K-means.
• Agnes: Computes agglomerative nesting (hierarchical clustering) of the dataset.
Support in R
22
23
Lending club
24
The Data
https://www.lendingclub.com/info/download-data.action
25
The Data
https://www.kaggle.com/wendykan/lending-club-loan-data
Variable description
• Calculate dissimilarity between observations.
• Select algorithm to group observations together
• Choose the best number of clusters
• Visualize clusters on reduced dimensions
Objective
• Partitioning around medoids (PAM) is used in this case.
• PAM is an iterative clustering procedure with the following steps:▫ Step 1: Choose k random entities to become the medoids.
▫ Step 2: Assign every entity to its closest medoid (using the distance matrix we have calculated).
▫ Step 3: For each cluster, identify the observation that would yield the lowest average distance if it were to be re-assigned as the medoid. If so, make this observation the new medoid.
▫ Step 4: If at least one medoid has changes, return to step 2. Otherwise, end the algorithm.
Selecting number of clusters
• One way to visualize many variables in a lower dimensional space is with t-distributed stochastic neighborhood embedding (t-SNE)
• This method is a dimension reduction technique that tries to preserve local structure so as to make clusters visible in a 2D or 3D visualization.
• https://en.wikipedia.org/wiki/T-distributed_stochastic_neighbor_embedding
Visualization with reduced dimension
30
31
Alternative Credit scoring in the news
32
Fintech being noticed by Regulators
33
• The regulatory sandbox allows businesses to test innovative products, services, business models and delivery mechanisms in the real market, with real consumers.
• The sandbox is a supervised space, open to both authorized and unauthorized firms, that provides firms with:▫ reduced time-to-market at potentially lower cost▫ appropriate consumer protection safeguards built in to new products and
services▫ better access to finance
• https://www.fca.org.uk/firms/regulatory-sandbox
Regulatory Sandboxes
34
US Regulators catching up
Model Validation
• “Model risk is the potential for adverse consequences from decisions based on incorrect or misused model outputs and reports. “ [1]
• “Model validation is the set of processes and activities intended to verify that models are performing as expected, in line with their design objectives and business uses. ” [1]
• Ref:• [1] . Supervisory Letter SR 11-7 on guidance on Model Risk
36
Popularity of Open-source software in the enterprise increasing
37
• Financial Services customers like Capital One, FINRA, and Pacific Life are moving critical workloads to AWS
Cloud maturing
38
• Versions and packages
Challenges in adopting Open-source software in the enterprise
39
• Difficulty in replicating and reconciling differences in environments
Challenges in adopting Open-source software in the enterprise
40
• Deploying models built by Data Scientists still a problem
Challenges in adopting Open-source software in the enterprise
Data Scientists Enterprise IT
41
• The try-before-adopt model is difficult with unproven open-source solutions
Challenges in adopting Open-source software in the enterprise
43
Quant/Enterprise use cases
• Create an environment that can support multiple platforms and programming languages
• Enable remote running of applications
• Ability to try out a Github submission/ someone else’s code
• Facilitate creation of Docker images to create replicable containers
• Create prototyping environments for Data Science/Quant teams
• Enable Data scientists/Quants to deploy their solutions
• Enable running multiple experiments concurrently
• Integrate seamlessly with the cloud to scale up computations
Use cases
44
Fintech use cases
• To demonstrate solutions to enterprises
• Create customized enterprise trials for companies that don’t permit installation of vendor software prior to procurement
• To manage quick updates
• Enable effective integration and hosting of services (REST APIs)
Use cases
45
Academic use cases
• Enable creation of course material and exercises that could be shared
• Enable students and workshop participants to focus on the data science experiments rather than environment setting
Use cases
46
Creating replicable environments
Creating and manage replicable environments (Code + software + data) in a single portal
47
Creating replicable environments
Create replicable environments (Code + software + data) through a easy point & click tool and publish to Dockerhub or manage internallyShare it with target users
48
User portal
• Run multiple experiments in pre-created environments (Code + software + data)• Deploy your own solutions• Run any Docker image or Github submission on the cloud
49
Run Jupyter notebooks and prototype applications
50
Run Rstudio and Shiny applications
51
Run any Docker application
52
Manage tasks and errors
53
User portal
• Dockerize and deploy applications on AWS in just a few steps
54
Deploy applications with ease
55
QU’s open source project – Project Mozaic
Thank you ARPM and enjoy the boot camp!
Checkout our programs at:www.analyticscertificate.com/fintech
www.qusandbox.com
Sri Krishnamurthy, CFA, CAPFounder and CEO
QuantUniversity LLC.
srikrishnamurthy
www.QuantUniversity.comInformation, data and drawings embodied in this presentation are strictly a property of QuantUniversity LLC. and shall not be
distributed or used in any other publication without the prior written consent of QuantUniversity LLC.
58