new detecting credit card fraud with machine learning cs...

Objectives Discussion and Future Work Motivation • Payments fraud is a significant and growing issue • More than $8 billion in in 2015; up 37% from 2012 • Key challenge with fraud data is class imbalance Goal: • Implement and assess ML algorithms to detect credit card fraud • Investigate strategies to address class imbalance Models Aaron Rosenbaum | [email protected] CS 229 | Spring 2019 Results: Stage 1 Results: Stage 2 Detecting Credit Card Fraud with Machine Learning Sampling Methods Data • Oversampling and synthetic data generation, when properly tuned, can lead to superior predictive performance in the face of class imbalance • Random forests are highly effective, easy to implement, and appear to be robust to class imbalance, at least for this particular dataset • Given that payments fraud is constantly evolving, future areas of work might include the application of reinforcement learning to a real-time data stream CS 229 Implementation Variable Description Time Time since first transaction V1-V28 Non-descriptive variables (PCA to protect privacy) Amount Transaction Amount Class 1 = Fraud; 0 otherwise) n = 284,807 d = 31 variables Volume % All Tx 284,807 100 Fraud 492 0.17 Not fraud 284,135 99.83 Class distribution Credit card dataset from Kaggle PCA VISUALS Undersampling: randomly delete observations from majority class Oversampling: randomly oversample from minority class Both: both under- and oversample ROSE: create artificial samples of minority class in neighborhood of existing ones Simple logistic regression with linear boundary ℎ = 1 1+ − ℓ =+ ,-. / , log ℎ , + 1− , log 1−ℎ , Logistic regression with quadratic boundary, LASSO . penalty . reduces variance by deleting some quadratic terms Random forest Averages many trees from bootstrap samples: 6 89 = . : ∑ <-. : ∗< Neural network 1 hidden layer, fully connected, sigmoid activation Dataset partition: 2/3 train, 1/6 validation, 1/6 test Tournament-style procedure: • Stage 1: Train variants of each model using different sampling strategies • Stage 2: Best performing model in each category on validation set refit using combined train / validation set and assessed against test set • Primary performance metric: AUPRC due to class imbalance Finalist performance on test set AUPRC Simple logistic: linear No sampling 0.6973303 Undersampling 0.6813981 Oversampling 0.7267684 Both 0.7061640 ROSE 0.7301219 Logistic : quadratic, LASSO Undersampling 0.6915803 Oversampling 0.7935913 Both 0.7036836 ROSE 0.5475979 *No sampling did not converge Random forest No sampling 0.8448476 Undersampling 0.8177367 Oversampling 0.8504548 Both 0.8381937 ROSE 0.7703227 Neural network Undersampling 0.6919809 Oversampling 0.2757658 Both 0.2979033 ROSE 0.7118781 Weighted option 0.2266319 *No sampling did not converge Bold models = category winner CV for sampling proportion Simple logistic regression LASSO results Logistic regression, quadratic (undersampling) Growing a random forest Error rate vs. # of trees (oversampling) AUROC AUPRC Accuracy Sensitivity Specificity F1 Linear logistic (ROSE) 0.98368 0.83476 0.9995 0.797619 0.999831 0.842673 Quad logistic (Over) 0.98398 0.88409 0.9993 0.734043 0.999873 0.816568 Random forest (Over) 0.98396 0.90957 0.9997 0.929577 0.999810 0.904110 Neural net (ROSE) 0.98280 0.73169 0.9984 0.831169 0.999768 0.842105 Assumes decision boundary at prob > ½ • Thank you to the CS229 teaching staff! Linear logistic (ROSE) Truth Pred 0 1 0 | 47426 17 1| 8 67 Quadratic logistic (Over) Truth Pred 0 1 0 | 47418 25 1 | 6 69 Random forest (Over) Truth Pred 0 1 0 | 47438 5 1 | 9 66 Neural net (ROSE) Truth Pred 0 1 0 | 47430 13 1 | 11 64

Upload: others

Post on 09-Oct-2020

0 views

Category:

Documents

0 download

Report

Download

Embed Size (px):

TRANSCRIPT

Page 1: New Detecting Credit Card Fraud with Machine Learning CS 229cs229.stanford.edu/proj2019spr/poster/32.pdf · 2019. 6. 18. · Fraud 492 0.17 Not fraud284,13599.83 Credit card dataset

Objectives

Discussion and Future Work

Motivation• Payments fraud is a significant and growing issue• More than $8 billion in in 2015; up 37% from 2012• Key challenge with fraud data is class imbalance

Goal: • Implement and assess ML algorithms to detect credit card fraud• Investigate strategies to address class imbalance

Models

Aaron Rosenbaum | [email protected] 229 | Spring 2019

Results: Stage 1

Results: Stage 2

Detecting Credit Card Fraud with Machine Learning

Sampling Methods

Data

• Oversampling and synthetic data generation, when properly tuned, can lead to superior predictive performance in the face of class imbalance

• Random forests are highly effective, easy to implement, and appear to be robust to class imbalance, at least for this particular dataset

• Given that payments fraud is constantly evolving, future areas of work might include the application of reinforcement learning to a real-time data stream

CS 229

Implementation

Variable DescriptionTime Time since first transaction

V1-V28 Non-descriptive variables (PCA to protect privacy)

Amount Transaction AmountClass 1 = Fraud; 0 otherwise)

n = 284,807 d = 31 variables

Volume % All Tx 284,807 100Fraud 492 0.17Not fraud 284,135 99.83

Class distributionCredit card dataset from Kaggle

PCA VISUALS

Undersampling: randomly delete observations from majority classOversampling: randomly oversample from minority class Both: both under- and oversampleROSE: create artificial samples of minority class in neighborhood of existing ones

Simple logistic regression with linear boundary

ℎ𝜃 𝑥 =1

1+ 𝑒−𝜃𝑇𝑥ℓ 𝜃 =+

,-.

𝑦 , log ℎ 𝑥 , + 1−𝑦 , log 1−ℎ 𝑥 ,

Logistic regression with quadratic boundary, LASSO𝐿. penalty 𝜆 𝜃 . reduces variance by deleting some quadratic terms

Random forestAverages many trees from bootstrap samples: 6𝑓89 𝑥 = .

:∑<-.: 𝑓∗< 𝑥

Neural network1 hidden layer, fully connected, sigmoid activation

Dataset partition: 2/3 train, 1/6 validation, 1/6 test

Tournament-style procedure:•Stage 1: Train variants of each model using different sampling strategies•Stage 2: Best performing model in each category on validation set refit using combined train / validation set and assessed against test set•Primary performance metric: AUPRC due to class imbalance

Finalist performance on test set

AUPRCSimple logistic: linearNo sampling 0.6973303Undersampling 0.6813981Oversampling 0.7267684Both 0.7061640ROSE 0.7301219

Logistic: quadratic, LASSOUndersampling 0.6915803Oversampling 0.7935913Both 0.7036836ROSE 0.5475979*No sampling did not converge

Random forestNo sampling 0.8448476Undersampling 0.8177367Oversampling 0.8504548Both 0.8381937ROSE 0.7703227

Neural networkUndersampling 0.6919809Oversampling 0.2757658Both 0.2979033ROSE 0.7118781Weighted option 0.2266319*No sampling did not converge

Bold models = category winner

CV for sampling proportionSimple logistic regression

LASSO resultsLogistic regression, quadratic (undersampling)

Growing a random forestError rate vs. # of trees (oversampling)

AUROC AUPRC Accuracy Sensitivity Specificity F1Linear logistic (ROSE) 0.98368 0.83476 0.9995 0.797619 0.999831 0.842673

Quad logistic (Over) 0.98398 0.88409 0.9993 0.734043 0.999873 0.816568

Random forest (Over) 0.98396 0.90957 0.9997 0.929577 0.999810 0.904110