new detecting credit card fraud with machine learning cs...
TRANSCRIPT
Objectives
Discussion and Future Work
Motivation• Payments fraud is a significant and growing issue• More than $8 billion in in 2015; up 37% from 2012• Key challenge with fraud data is class imbalance
Goal: • Implement and assess ML algorithms to detect credit card fraud• Investigate strategies to address class imbalance
Models
Aaron Rosenbaum | [email protected] 229 | Spring 2019
Results: Stage 1
Results: Stage 2
Detecting Credit Card Fraud with Machine Learning
Sampling Methods
Data
• Oversampling and synthetic data generation, when properly tuned, can lead to superior predictive performance in the face of class imbalance
• Random forests are highly effective, easy to implement, and appear to be robust to class imbalance, at least for this particular dataset
• Given that payments fraud is constantly evolving, future areas of work might include the application of reinforcement learning to a real-time data stream
CS 229
Implementation
Variable DescriptionTime Time since first transaction
V1-V28 Non-descriptive variables (PCA to protect privacy)
Amount Transaction AmountClass 1 = Fraud; 0 otherwise)
n = 284,807 d = 31 variables
Volume % All Tx 284,807 100Fraud 492 0.17Not fraud 284,135 99.83
Class distributionCredit card dataset from Kaggle
PCA VISUALS
Undersampling: randomly delete observations from majority classOversampling: randomly oversample from minority class Both: both under- and oversampleROSE: create artificial samples of minority class in neighborhood of existing ones
Simple logistic regression with linear boundary
ℎ𝜃 𝑥 =1
1+ 𝑒−𝜃𝑇𝑥ℓ 𝜃 =+
,-.
/
𝑦 , log ℎ 𝑥 , + 1−𝑦 , log 1−ℎ 𝑥 ,
Logistic regression with quadratic boundary, LASSO𝐿. penalty 𝜆 𝜃 . reduces variance by deleting some quadratic terms
Random forestAverages many trees from bootstrap samples: 6𝑓89 𝑥 = .
:∑<-.: 𝑓∗< 𝑥
Neural network1 hidden layer, fully connected, sigmoid activation
Dataset partition: 2/3 train, 1/6 validation, 1/6 test
Tournament-style procedure:•Stage 1: Train variants of each model using different sampling strategies•Stage 2: Best performing model in each category on validation set refit using combined train / validation set and assessed against test set•Primary performance metric: AUPRC due to class imbalance
Finalist performance on test set
AUPRCSimple logistic: linearNo sampling 0.6973303Undersampling 0.6813981Oversampling 0.7267684Both 0.7061640ROSE 0.7301219
Logistic: quadratic, LASSOUndersampling 0.6915803Oversampling 0.7935913Both 0.7036836ROSE 0.5475979*No sampling did not converge
Random forestNo sampling 0.8448476Undersampling 0.8177367Oversampling 0.8504548Both 0.8381937ROSE 0.7703227
Neural networkUndersampling 0.6919809Oversampling 0.2757658Both 0.2979033ROSE 0.7118781Weighted option 0.2266319*No sampling did not converge
Bold models = category winner
CV for sampling proportionSimple logistic regression
LASSO resultsLogistic regression, quadratic (undersampling)
Growing a random forestError rate vs. # of trees (oversampling)
AUROC AUPRC Accuracy Sensitivity Specificity F1Linear logistic (ROSE) 0.98368 0.83476 0.9995 0.797619 0.999831 0.842673
Quad logistic (Over) 0.98398 0.88409 0.9993 0.734043 0.999873 0.816568
Random forest (Over) 0.98396 0.90957 0.9997 0.929577 0.999810 0.904110
Neural net (ROSE) 0.98280 0.73169 0.9984 0.831169 0.999768 0.842105
Assumes decision boundary at prob > ½
• Thank you to the CS229 teaching staff!
Linear logistic(ROSE)
TruthPred 0 1
0 | 47426 171 | 8 67
Quadratic logistic (Over)
TruthPred 0 1
0 | 47418 251 | 6 69
Random forest (Over)
TruthPred 0 1
0 | 47438 51 | 9 66
Neural net (ROSE)
TruthPred 0 1
0 | 47430 131 | 11 64