neometrics client logo a mathematical model for fraud prediction and control planning of electrical...
Post on 21-Dec-2015
212 views
TRANSCRIPT
neometrics Client logo
A mathematical model for fraud prediction and control planning of electrical company
clientsJune 2010
-Camilla Bruni (Italy). Università degli Studi di Firenze-Herberth Espinoza (Peru). Universidad Complutense de Madrid
-Antoine Levitt (France). University of Oxford-María Loreto Luque (Spain). Universidad Complutense de Madrid
-Carlos Parra (Spain). Universidad Complutense de Madrid-Aranzazu Pérez (Spain). Universidad Complutense de Madrid
-Elisa Pérez (Spain) Universidad Complutense de Madrid
Coordinators:-Dr. Benjamin Ivorra (Universidad Complutense de Madrid)
-Dr. Juan Tejada (Universidad Complutense de Madrid) -D. Jorge Juan Suerias (Neo Metrics)
-D. Fernando Fernández (Neo Metrics) -Pilar Gómez (Neo Metrics)
neometrics Client logo
Outline
• Problem Description
• Data Analysis
• Mathematical Modeling
• Numerical Validation
• Conclusions and Perspectives
neometrics Client logo
I – PROBLEM DESCRIPTION
4neometrics
Context
• An electrical company keeps a crew of inspectors in Chile to check whether customers are manipulating their electrical meters.
• Each check has a cost and the company wishes to identify the customers with higher risk of fraud in order to maximize benefits in a possible control campaign.
5neometrics
Objectives of the work
• Currently, the company policy to check randomly on customers detects 6.6% of fraud.
• Using the data set provided by the company, we created a model to design an improved control campaign. To do so, we have: Data exploration and treatment using SAS. Fit a logistic regression model using SAS and calculate
the lift-chart/ROC model performance indicators. According to the model find the optimal control
campaign, using Matlab, and perform a validation of the results.
neometrics Client logo
II – DATA ANALYSIS
7neometrics
DATA ANALYSIS
What does Neometrics supply us?
We received from Neometrics the database, train.csv Database characteristics:
– 79,459 records – 49 variables– 30MB
8neometrics
DATA DESCRIPTION
We have 49 explanatory variables which can be divided in several groups.
We consider some variables referring to each customers characteristic (economical, geographical, technical…)
We select and simplify the more representing variables. To do so:
– We categorize non-linear continuous variables according to fraud proportion in population.
– We analyze the groups that have both categorical and quantitative variables in discrimination techniques (Binary tree).
– Groups containing only quantitative variables apply principal components (PCA) to see these correlations.
9neometrics
Variable simplification
In order to identify the non-linear continuous variables and representative values, we calculate:
• Population proportion on each group:
• Fraud proportion on each group:
Number Of People In Each Group
Total Number Of People
Number Of Fraudulent People In Each Group
Number Of People In Each Group
10neometrics
Variable simplification
11neometrics
Variable simplification
12neometrics
Variable simplification
13neometrics
Variable simplification
14neometrics
VARIABLES SELECTION
The principal component analysis (PCA) is applied to the Debt and Payment group variable.
The aim of PCA is to reduce the size of the observed variables for each individual, keeping the greatest variability.
PCA is based on the spectral analysis of the correlation matrix.
This process was performed using SAS.
15neometrics
SELECTION OF VARIABLES
We start with the principal component analysis for groups of variables: Calculated variables of Debt and Payment.
The most relevant variables are: deuda_ult_mean, max_deuda, dif_ult_mean_deuda mean_dif_deuda.
After finishing the procedure, we select three variables, accounting for 90% of the information.
Repeating the process
16neometrics
Study Of Significance
• We realize a study on the most significant variables that we might include in our model. For example: Geographic Variables, Conexion Identifiers, Customer Caracteristic Groups.
• proc logistic; selection = stepwise
• Pearson Correlation Coeficient
17neometrics
Study Of Multicolinearity
• VIF = variance inflaction
• The VIF represents an increase in the variance due to presence of multicollinearity.• VIF take values from a minimum of 1 when there is no degree of multicollinearity
Other procedures: Discriminal Analysis (DISCRIM)
• max_deuda
18neometrics
Classification trees
To obtain a good predictive model
Decision support models that can be applied in the identification of fraudsters
target ResultadoCd_SectorNm_zonaCODIGOAMPERIOS
Customers caracteristic num_cortesCalculated variables of debt max_deuda
cat_max_pagocat_pago_ult_meancat_dif_ult_mean_pagocat_ult_pago
Geographic variables
Conexion identifiers
Calculated variables of payments (categorize non-linear continuous )
neometrics Client logo
• III – MATHEMATICAL MODELING
20neometrics
Regression
Predict a random variable with a set of explanatory random variables
In our case, predict fraud probability using client information Well-studied problem, numerous applications (spam filter,
image classification, insurance policy...) Simplest idea: linear regression
iX
1 2( , ),Y f X X
TY a b X
Y
21neometrics
Logistic regression
is a probability : . Linear regression does not respect the constraint.
Map result of linear regression to a probability with logistic curve
We calibrate model parameters on the training set (find optimal a and b), and test on the validation set.
Use an optimisation procedure to fit model
Other possibilities: add terms in the model. Interactions ( etc.), nonlinearities ( ), etc.
Y 0 1Y TY a b X
logisti )c( TY Xa b
,i j i j kx x x x x2ix
22neometrics
Model evaluation
In order to validate our model:
We compare estimated fraud probabilities to measured outcomes on the validation set.
Use different metrics: ROC curve, lift-chart, etc..
Use the results to establish optimal control policy
23neometrics
ROC curve
X-axis: false positives, Y-axis : true positives For a given allowed false positive rate (specificity), determine success rate
(sensitivity) Perfect model: Y = 1, random model, Y = X, bad model, Y = 0 Goal: get above Y = X Area under curve good indication of quality of model: c-value
24neometrics
LIFT CHART
We sort the client according to their decreasing fraud probabilities.
The X-axis represent the percentage of the population according to the previous arrangement.
The Y-axis represent a rate calculated as:
Rate (α) = Number of fraud clients in the modelint he first α% of the population
(Number of fraudulent clients/ population size ) * α% of the population
neometrics Client logo
• IV – NUMERICAL VALIDATION
26neometrics
Results
• Use only a subset of available parameters• Zone, type of line, past fraud history, past payment history• Categorise continuous variables• Final model: 8 categorical variables, with or without pairwise
interactions• Model trained on 40k clients, validated on 40k others.
27neometrics
LIFT-Chart graph and ROC values
Valid
Train
Pairwise interactions (ROC=0.737)No interactions (ROC=0.698)
28neometrics
Cost-benefit analysis – Best model
TrainValidation
29neometrics
Cost-benefit analysis – Extreme models
Best model Random model
neometrics Client logo
• V – CONCLUSIONS AND PERSPECTIVES
neometrics Client logo
Summary
• Data analysis to isolate interesting variables
• Logistic regression to predict fraud probabilities
• Evaluation of the model (ROC, lift)
• Use in a cost-benefit analysis
• Concrete results of use to the client
neometrics Client logo
Future Work
• Automatic data analysis (binary trees, principal components analysis, etc.)
• Carefully chose variables to include• Other types of prediction (neural networks,
decision trees, etc.)• More complex optimization processes
33neometrics
THANK YOU