Download - A SURVEY OF RANDOM FOREST USAGE FOR FRAUD DETECTION …€¦ · USAGE FOR FRAUD DETECTION AT LLOYDS BANKING GROUP Adam Langron AUGUST 2015. CONTENTS Background Current Account Fraud

A SURVEY OF RANDOM FOREST USAGE FOR FRAUD DETECTION

AT LLOYDS BANKING GROUP

Adam Langron

AUGUST 2015

CONTENTS

Background

Current Account Fraud

Insider Fraud

Mortgage Broker Risk

Conclusions

2

BACKGROUND

3

BACKGROUND

4

2012 Masters project with Xin Huang suggested that random forests would provide improved discrimination over traditional logistic regression models.

This formed the basis of 2013 CRC conference paper by Kevin Barrett.

RANDOM FORESTSBRIEF INTRODUCTION

5

Bootstrap Sample

Split 1a

Split 2a Split 2b

Split 1b

Bootstrap Sample

Split 1a

Split 2a Split 2b

Split 1b

Bootstrap Sample

Split 1a

Split 2a Split 2b

Split 1b

Bootstrap Sample

Split 1a

Split 2a Split 2b

Split 1b

Bootstrap Sample

Split 1a

Split 2a Split 2b

Split 1b

Bootstrap Sample

Split 1a

Split 2a Split 2b

Split 1b

Hundreds of trees – a forest!

These leaves at the bottom of the trees ‘vote’

If > 50% of the training sample were good at this leaf, the vote is ‘good’

If > 50% of the training sample were bad at this leaf, the vote is ‘bad’

Each new observation will fall to one final leaf node in each tree

Votes are counted and a predicted outcome assigned

This is a ratio that can be interpreted as a score

RANDOM FORESTSPROS AND CONS

PROS CONS

IMPLEMENTATION

‘GREY BOX’

INTERPRETABILITY

MONITORING

OPTIMAL?

DISCOVERS INTERACTIONS

RAPID - NO BINNING

INCORPORATES MANY PREDICTORS

6

CURRENT ACCOUNT FRAUD

FIRST PARTY FRAUD DETECTION USING PERSONAL

CURRENT ACCOUNT TRANSATIONS

7

-7-6-5-4-3-2-10123456789

101112

01/08/2012

31/08/2012

30/09/2012

30/10/2012

29/11/2012

29/12/2012

28/01/2013

27/02/2013

29/03/2013

28/04/2013

28/05/2013

27/06/2013

27/07/2013

26/08/2013

25/09/2013

25/10/2013

24/11/2013

Bal

ance

/ Tr

ansa

ctio

n Va

lue

(£ T

hous

ands

)

Other Non-MonFalcon FlagsAccount FeatureIndicatorBalance EnquiryCredit Limit

CURRENT ACCOUNTS – THE PROBLEMWhat is the best way to use the vast amount of data that the bank holds to detect fraud and refer accounts at the right time to allow prevention?

8

Fraudulent Cheque Credit

Online Spend Charge-Off

Atypical Behaviour

Cheque Reversal

A Cheque Fraud Case Study

CHARACTERISTICSThe richness of transactional data allows over 100,000 possible predictors to be created

•Visa Debit authorisations•Accepts/Declines•Grouped by Merchant Category

Code

Point of Sale

•Branch transactions•Deposits and withdrawals

Counter

•Non–counter payments and transfers

• Faster Payments, DD•All channels

Transfers

•Non-monetary events•Balance enquiries•Account status checks•Grouped by channel

Enquiries

9

•Days from obs point•0 -10 days

Days

•Weeks from obs point•0 - 6 weeks

Weeks

•Volume•Value•Accept/Decline•Excess•Utilisation

Measure

Combining transaction type with time interval and measure yields a multitude of predictors

Using Merchant Category Code and Money Manger within the relevant transaction types enabled even more granularity

Infrastructure necessitated the reduction of these characteristics Information Value macros cannot cope with this number of chars Chars populated very sparsely were removed Chars with suspiciously high bad rates were removed if there

was evidence of fraud investigation activity Information Value and was used to select final candidate chars

X X

MODEL PERFORMANCEModel performance on the validation sample is strong

10

Performance is shown for the development validation data set Discrimination is very high: Gini is in excess of 80% Validation has been performed on three out of sample holdouts and one out of time

holdout and shows stable performance The model discriminates well for all three fraud categories: First Party Definitions, EUCs

and Mules Investigator feedback from a small sample verified that the model ranks effectively and

finds a wide variety of fraud typologies

0102030405060708090

100

0 10 20 30 40 50 60 70 80 90 100

Cum

ulat

ive

% B

ad

Cumulative % Good

Validation Sample Gini

0102030405060708090

100

0.00.10.20.30.40.50.60.70.80.91.0

Cum

ulat

ive

Pop

%

Trinity Score

Cumulative Score Distribution

BadDefsEUCMuleGood

The model was initially piloted in an Operational environment To confirm referrals were generated at the ‘right’ point That volumes were manageable That Trinity could ‘add value’ to the existing strategy suite

Pilot results were very positive c60% of reviewed cases were high risk c25% of the high risk referrals were unique to Trinity The model identified a good mix of fraud types

The model now implemented with fraud operations Scores are refreshed daily Trinity scores are crossed with a fraud propensity model to drive

referrals

IMPLEMENTATIONAccounts now being referred to Fraud Operations

11

StrategyMatrix of Trinity & Fraud Score

Low Risk Med or High Risk

Very High Risk

0 - 800 Not Referred

800 - 900

900 - 1000 Referred

Trin

ity S

core

High Risk Fraud Score

Credit Manipulation Bust-Out

Runaway Spend Refund Fraud

Fraud Typologies Identified

INSIDER FRAUD

DETECTING INSIDER FRAUD USING COLLEAGUE

INTERACTIONS

12

Working Pattern

1

Location

Branch / Telephone

Days worked

Insert

2

Account notifications

Transaction earmarks

Enquire

3

Standing order

Balance

Financial

4

Inter account transfers

Cheques

Counter withdrawals

Amend

5

Change of address

Change of telephone

number

MODEL DRIVERSThe model uses hundreds of event types as predictors, these are driven by colleagues and related to system log on identifiers

13

TransCategory

Rank

TransType

MODEL PERFORMANCEModel performance on the validation sample is strong

0%10%20%30%40%50%60%70%80%90%

100%

0% 20% 40% 60% 80% 100%

% o

f Bad

s

% of Goods

14

Performance is shown for the development validation data set Discrimination is very high: Gini c70% The model discriminates particularly well at the top of the distribution Initial use case will be to exclude all staff members from investigation where the model

shows sufficiently low risk

MORTGAGE BROKER RISK

ASSESSING THE RISK OF MORTGAGE BROKERS USING

APPLICATION MIX

15

Model developed to assess risk posed by Mortgage Broker Panel Over 55,000 characteristics assessed relating to application mix Random Forest and logistic regression models developed RF model quickly deteriorates in out of time sample

MORTGAGE BROKER RISKUNABLE TO VALIDATE RESULTS OUT OF TIME

16

Out of Time Sample

0%

20%

40%

60%

80%

100%

Jan Feb Mar Apr May Jun Jul Aug Sep Oct Nov

GIN

I

Random ForestLogistic Regression

Desire to investigate further. Is the algorithm overfitting to the development sample? Lack of resource to investigate at this time

CONCLUSION

17

Positive Results• Very high discrimination• Relatively quick to build• Many predictors used to assess risk• Hard for fraudsters to game

CONCLUSIONRandom Forests can add value to fraud prevention

18

Ongoing Challenges• Validation issues. Are the models more likely to over fit to the development period?• Implementation challenges. Currently no decisioning platforms for deployment• Monitoring. What to monitor on monthly / quarterly basis?• Desire to develop best practice for development and ongoing monitoring and maintenance

Download - A SURVEY OF RANDOM FOREST USAGE FOR FRAUD DETECTION …€¦ · USAGE FOR FRAUD DETECTION AT LLOYDS BANKING GROUP Adam Langron AUGUST 2015. CONTENTS Background Current Account Fraud

Top Related