neometrics client logo a mathematical model for fraud prediction and control planning of electrical...

neometrics Client logo

A mathematical model for fraud prediction and control planning of electrical company

clientsJune 2010

-Camilla Bruni (Italy). Università degli Studi di Firenze-Herberth Espinoza (Peru). Universidad Complutense de Madrid

-Antoine Levitt (France). University of Oxford-María Loreto Luque (Spain). Universidad Complutense de Madrid

-Carlos Parra (Spain). Universidad Complutense de Madrid-Aranzazu Pérez (Spain). Universidad Complutense de Madrid

-Elisa Pérez (Spain) Universidad Complutense de Madrid

Coordinators:-Dr. Benjamin Ivorra (Universidad Complutense de Madrid)

-Dr. Juan Tejada (Universidad Complutense de Madrid) -D. Jorge Juan Suerias (Neo Metrics)

-D. Fernando Fernández (Neo Metrics) -Pilar Gómez (Neo Metrics)


Outline

• Problem Description

• Data Analysis

• Mathematical Modeling

• Numerical Validation

• Conclusions and Perspectives


I – PROBLEM DESCRIPTION

4neometrics

Context

• An electrical company keeps a crew of inspectors in Chile to check whether customers are manipulating their electrical meters.

• Each check has a cost and the company wishes to identify the customers with higher risk of fraud in order to maximize benefits in a possible control campaign.

5neometrics

Objectives of the work

• Currently, the company policy to check randomly on customers detects 6.6% of fraud.

• Using the data set provided by the company, we created a model to design an improved control campaign. To do so, we have: Data exploration and treatment using SAS. Fit a logistic regression model using SAS and calculate

the lift-chart/ROC model performance indicators. According to the model find the optimal control

campaign, using Matlab, and perform a validation of the results.


II – DATA ANALYSIS

7neometrics

DATA ANALYSIS

What does Neometrics supply us?

We received from Neometrics the database, train.csv Database characteristics:

– 79,459 records – 49 variables– 30MB

8neometrics

DATA DESCRIPTION

We have 49 explanatory variables which can be divided in several groups.

We consider some variables referring to each customers characteristic (economical, geographical, technical…)

We select and simplify the more representing variables. To do so:

– We categorize non-linear continuous variables according to fraud proportion in population.

– We analyze the groups that have both categorical and quantitative variables in discrimination techniques (Binary tree).

– Groups containing only quantitative variables apply principal components (PCA) to see these correlations.

9neometrics

Variable simplification

In order to identify the non-linear continuous variables and representative values, we calculate:

• Population proportion on each group:

• Fraud proportion on each group:

Number Of People In Each Group

Total Number Of People

Number Of Fraudulent People In Each Group

Number Of People In Each Group

10neometrics


11neometrics


12neometrics


13neometrics


14neometrics

VARIABLES SELECTION

The principal component analysis (PCA) is applied to the Debt and Payment group variable.

The aim of PCA is to reduce the size of the observed variables for each individual, keeping the greatest variability.

PCA is based on the spectral analysis of the correlation matrix.

This process was performed using SAS.

15neometrics

SELECTION OF VARIABLES

We start with the principal component analysis for groups of variables: Calculated variables of Debt and Payment.

The most relevant variables are: deuda_ult_mean, max_deuda, dif_ult_mean_deuda mean_dif_deuda.

After finishing the procedure, we select three variables, accounting for 90% of the information.

Repeating the process

16neometrics

Study Of Significance

• We realize a study on the most significant variables that we might include in our model. For example: Geographic Variables, Conexion Identifiers, Customer Caracteristic Groups.

• proc logistic; selection = stepwise

• Pearson Correlation Coeficient

17neometrics

Study Of Multicolinearity

• VIF = variance inflaction

• The VIF represents an increase in the variance due to presence of multicollinearity.• VIF take values from a minimum of 1 when there is no degree of multicollinearity

Other procedures: Discriminal Analysis (DISCRIM)

• max_deuda

18neometrics

Classification trees

To obtain a good predictive model

Decision support models that can be applied in the identification of fraudsters

target ResultadoCd_SectorNm_zonaCODIGOAMPERIOS

Customers caracteristic num_cortesCalculated variables of debt max_deuda

cat_max_pagocat_pago_ult_meancat_dif_ult_mean_pagocat_ult_pago

Geographic variables

Conexion identifiers

Calculated variables of payments (categorize non-linear continuous )


• III – MATHEMATICAL MODELING

20neometrics

Regression

Predict a random variable with a set of explanatory random variables

In our case, predict fraud probability using client information Well-studied problem, numerous applications (spam filter,

image classification, insurance policy...) Simplest idea: linear regression

iX

1 2( , ),Y f X X

TY a b X

Y

21neometrics

Logistic regression

is a probability : . Linear regression does not respect the constraint.

Map result of linear regression to a probability with logistic curve

We calibrate model parameters on the training set (find optimal a and b), and test on the validation set.

Use an optimisation procedure to fit model

Other possibilities: add terms in the model. Interactions ( etc.), nonlinearities ( ), etc.

Y 0 1Y TY a b X

logisti )c( TY Xa b

,i j i j kx x x x x2ix

22neometrics

Model evaluation

In order to validate our model:

We compare estimated fraud probabilities to measured outcomes on the validation set.

Use different metrics: ROC curve, lift-chart, etc..

Use the results to establish optimal control policy

23neometrics

ROC curve

X-axis: false positives, Y-axis : true positives For a given allowed false positive rate (specificity), determine success rate

(sensitivity) Perfect model: Y = 1, random model, Y = X, bad model, Y = 0 Goal: get above Y = X Area under curve good indication of quality of model: c-value

24neometrics

LIFT CHART

We sort the client according to their decreasing fraud probabilities.

The X-axis represent the percentage of the population according to the previous arrangement.

The Y-axis represent a rate calculated as:

Rate (α) = Number of fraud clients in the modelint he first α% of the population

(Number of fraudulent clients/ population size ) * α% of the population


• IV – NUMERICAL VALIDATION

26neometrics

Results

• Use only a subset of available parameters• Zone, type of line, past fraud history, past payment history• Categorise continuous variables• Final model: 8 categorical variables, with or without pairwise

interactions• Model trained on 40k clients, validated on 40k others.

27neometrics

LIFT-Chart graph and ROC values

Valid

Train

Pairwise interactions (ROC=0.737)No interactions (ROC=0.698)

28neometrics

Cost-benefit analysis – Best model

TrainValidation

29neometrics

Cost-benefit analysis – Extreme models

Best model Random model


• V – CONCLUSIONS AND PERSPECTIVES


Summary

• Data analysis to isolate interesting variables

• Logistic regression to predict fraud probabilities

• Evaluation of the model (ROC, lift)

• Use in a cost-benefit analysis

• Concrete results of use to the client


Future Work

• Automatic data analysis (binary trees, principal components analysis, etc.)

• Carefully chose variables to include• Other types of prediction (neural networks,

decision trees, etc.)• More complex optimization processes

33neometrics

THANK YOU

neometrics client logo a mathematical model for fraud prediction and control planning of electrical...

Documents

neometrics variables

neometrics data analysis

neometrics context

neometrics objectives

neometrics data description

data analysis slide

mb slide

quantitative variables