neometrics client logo a mathematical model for fraud prediction and control planning of electrical...

33
neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università degli Studi di Firenze -Herberth Espinoza (Peru). Universidad Complutense de Madrid -Antoine Levitt (France). University of Oxford -María Loreto Luque (Spain). Universidad Complutense de Madrid -Carlos Parra (Spain). Universidad Complutense de Madrid -Aranzazu Pérez (Spain). Universidad Complutense de Madrid -Elisa Pérez (Spain) Universidad Complutense de Madrid Coordinators: -Dr. Benjamin Ivorra (Universidad Complutense de Madrid) -Dr. Juan Tejada (Universidad Complutense de Madrid) -D. Jorge Juan Suerias (Neo Metrics) -D. Fernando Fernández (Neo Metrics) -Pilar Gómez (Neo Metrics)

Post on 21-Dec-2015

212 views

Category:

Documents


0 download

TRANSCRIPT

Page 1: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

neometrics Client logo

A mathematical model for fraud prediction and control planning of electrical company

clientsJune 2010

-Camilla Bruni (Italy). Università degli Studi di Firenze-Herberth Espinoza (Peru). Universidad Complutense de Madrid

-Antoine Levitt (France). University of Oxford-María Loreto Luque (Spain). Universidad Complutense de Madrid

-Carlos Parra (Spain). Universidad Complutense de Madrid-Aranzazu Pérez (Spain). Universidad Complutense de Madrid

-Elisa Pérez (Spain) Universidad Complutense de Madrid

Coordinators:-Dr. Benjamin Ivorra (Universidad Complutense de Madrid)

-Dr. Juan Tejada (Universidad Complutense de Madrid) -D. Jorge Juan Suerias (Neo Metrics)

-D. Fernando Fernández (Neo Metrics) -Pilar Gómez (Neo Metrics)

Page 2: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

neometrics Client logo

Outline

• Problem Description

• Data Analysis

• Mathematical Modeling

• Numerical Validation

• Conclusions and Perspectives

Page 3: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

neometrics Client logo

I – PROBLEM DESCRIPTION

Page 4: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

4neometrics

Context

• An electrical company keeps a crew of inspectors in Chile to check whether customers are manipulating their electrical meters.

• Each check has a cost and the company wishes to identify the customers with higher risk of fraud in order to maximize benefits in a possible control campaign.

Page 5: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

5neometrics

Objectives of the work

• Currently, the company policy to check randomly on customers detects 6.6% of fraud.

• Using the data set provided by the company, we created a model to design an improved control campaign. To do so, we have: Data exploration and treatment using SAS. Fit a logistic regression model using SAS and calculate

the lift-chart/ROC model performance indicators. According to the model find the optimal control

campaign, using Matlab, and perform a validation of the results.

Page 6: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

neometrics Client logo

II – DATA ANALYSIS

Page 7: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

7neometrics

DATA ANALYSIS

What does Neometrics supply us?

We received from Neometrics the database, train.csv Database characteristics:

– 79,459 records – 49 variables– 30MB

Page 8: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

8neometrics

DATA DESCRIPTION

We have 49 explanatory variables which can be divided in several groups.

We consider some variables referring to each customers characteristic (economical, geographical, technical…)

We select and simplify the more representing variables. To do so:

– We categorize non-linear continuous variables according to fraud proportion in population.

– We analyze the groups that have both categorical and quantitative variables in discrimination techniques (Binary tree).

– Groups containing only quantitative variables apply principal components (PCA) to see these correlations.

Page 9: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

9neometrics

Variable simplification

In order to identify the non-linear continuous variables and representative values, we calculate:

• Population proportion on each group:

• Fraud proportion on each group:

Number Of People In Each Group

Total Number Of People

Number Of Fraudulent People In Each Group

Number Of People In Each Group

Page 10: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

10neometrics

Variable simplification

Page 11: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

11neometrics

Variable simplification

Page 12: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

12neometrics

Variable simplification

Page 13: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

13neometrics

Variable simplification

Page 14: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

14neometrics

VARIABLES SELECTION

The principal component analysis (PCA) is applied to the Debt and Payment group variable.

The aim of PCA is to reduce the size of the observed variables for each individual, keeping the greatest variability.

PCA is based on the spectral analysis of the correlation matrix.

This process was performed using SAS.

Page 15: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

15neometrics

SELECTION OF VARIABLES

We start with the principal component analysis for groups of variables: Calculated variables of Debt and Payment.

The most relevant variables are: deuda_ult_mean, max_deuda, dif_ult_mean_deuda mean_dif_deuda.

After finishing the procedure, we select three variables, accounting for 90% of the information.

Repeating the process

Page 16: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

16neometrics

Study Of Significance

• We realize a study on the most significant variables that we might include in our model. For example: Geographic Variables, Conexion Identifiers, Customer Caracteristic Groups.

• proc logistic; selection = stepwise

• Pearson Correlation Coeficient

Page 17: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

17neometrics

Study Of Multicolinearity

• VIF = variance inflaction

• The VIF represents an increase in the variance due to presence of multicollinearity.• VIF take values from a minimum of 1 when there is no degree of multicollinearity

Other procedures: Discriminal Analysis (DISCRIM)

• max_deuda

Page 18: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

18neometrics

Classification trees

To obtain a good predictive model

Decision support models that can be applied in the identification of fraudsters

target ResultadoCd_SectorNm_zonaCODIGOAMPERIOS

Customers caracteristic num_cortesCalculated variables of debt max_deuda

cat_max_pagocat_pago_ult_meancat_dif_ult_mean_pagocat_ult_pago

Geographic variables

Conexion identifiers

Calculated variables of payments (categorize non-linear continuous )

Page 19: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

neometrics Client logo

• III – MATHEMATICAL MODELING

Page 20: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

20neometrics

Regression

Predict a random variable with a set of explanatory random variables

In our case, predict fraud probability using client information Well-studied problem, numerous applications (spam filter,

image classification, insurance policy...) Simplest idea: linear regression

iX

1 2( , ),Y f X X

TY a b X

Y

Page 21: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

21neometrics

Logistic regression

is a probability : . Linear regression does not respect the constraint.

Map result of linear regression to a probability with logistic curve

We calibrate model parameters on the training set (find optimal a and b), and test on the validation set.

Use an optimisation procedure to fit model

Other possibilities: add terms in the model. Interactions ( etc.), nonlinearities ( ), etc.

Y 0 1Y TY a b X

logisti )c( TY Xa b

,i j i j kx x x x x2ix

Page 22: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

22neometrics

Model evaluation

In order to validate our model:

We compare estimated fraud probabilities to measured outcomes on the validation set.

Use different metrics: ROC curve, lift-chart, etc..

Use the results to establish optimal control policy

Page 23: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

23neometrics

ROC curve

X-axis: false positives, Y-axis : true positives For a given allowed false positive rate (specificity), determine success rate

(sensitivity) Perfect model: Y = 1, random model, Y = X, bad model, Y = 0 Goal: get above Y = X Area under curve good indication of quality of model: c-value

Page 24: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

24neometrics

LIFT CHART

We sort the client according to their decreasing fraud probabilities.

The X-axis represent the percentage of the population according to the previous arrangement.

The Y-axis represent a rate calculated as:

Rate (α) = Number of fraud clients in the modelint he first α% of the population

(Number of fraudulent clients/ population size ) * α% of the population

Page 25: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

neometrics Client logo

• IV – NUMERICAL VALIDATION

Page 26: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

26neometrics

Results

• Use only a subset of available parameters• Zone, type of line, past fraud history, past payment history• Categorise continuous variables• Final model: 8 categorical variables, with or without pairwise

interactions• Model trained on 40k clients, validated on 40k others.

Page 27: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

27neometrics

LIFT-Chart graph and ROC values

Valid

Train

Pairwise interactions (ROC=0.737)No interactions (ROC=0.698)

Page 28: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

28neometrics

Cost-benefit analysis – Best model

TrainValidation

Page 29: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

29neometrics

Cost-benefit analysis – Extreme models

Best model Random model

Page 30: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

neometrics Client logo

• V – CONCLUSIONS AND PERSPECTIVES

Page 31: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

neometrics Client logo

Summary

• Data analysis to isolate interesting variables

• Logistic regression to predict fraud probabilities

• Evaluation of the model (ROC, lift)

• Use in a cost-benefit analysis

• Concrete results of use to the client

Page 32: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

neometrics Client logo

Future Work

• Automatic data analysis (binary trees, principal components analysis, etc.)

• Carefully chose variables to include• Other types of prediction (neural networks,

decision trees, etc.)• More complex optimization processes

Page 33: Neometrics Client logo A mathematical model for fraud prediction and control planning of electrical company clients June 2010 -Camilla Bruni (Italy). Università

33neometrics

THANK YOU