data mining for fraud detection

Weighted Outlier Variable

Data Mining for Fraud Detection

COGNIZANT ENTERPRISE ANALYTICS WHITEPAPER

2 Introduction

The data analysis of outliers is frequently used in the In connection with cases involving securities,

detection of potential fraud in both public and private commodities, investment and advanced fee fraud

industry. Private and public institutions attempt to schemes, conduct central to many corporate fraud

detect fraudulent situations, while maintaining large investigations, federal prosecutors were awarded over

data sets. Specifically, the analysis of outliers is used in $2.5 billion in fines, forfeitures and restitution from July

the detection of healthcare insurance fraud. An outlier is 1, 2002 through March 31, 2003.

an observation that lies outside the overall pattern of a

distribution in the data. Usually, the presence of an

outlier indicates some sort of problem. Nevertheless, an

outlier, in and of itself, is not an indication of potential Detecting fraud through data mining is the proverbial fraud. For the outlier to be considered representative of problem of finding the “needle in a haystack. How can fraudulent activity, the variable must take into

consideration the purpose and means used by those relevant aberrant behavior be detected in a never-

committing the fraud. ending sea of data? The answer lies in the creation of a

formula that combines the purpose and means of fraud, The purpose of fraud is self-serving enrichment through

while sifting through the similarities and the differences the means of unlawful, deceptive, or illegal practices.

in the data. If financial gain is the purpose of fraud, the The means are overutilization of goods and services, or effects of the means to perpetuate a fraud such financial deceptive business practices. For the purpose of the gain should give us variables that would detect fraud.identification of fraud and abuse, the creation of

variables is necessitated. To abstract valid outliers of An analogy that can be used to contemplate the fraud, a variable formula needs to be constructed detection of potential fraud is sand filled balloon. The combining the factors of finance and the means of fraud balloon represents the population under study. The perpetuation. sand represents the entities to be studied. The weighted

outlier variable operates on the concept that if the A significant outlier variable analyzes the relationship balloon is squeezed by the correct variable, the entities between the purpose and means used in fraud to create coming out of the end of the balloon are going to meaningful variables. The supposition of this formula is represent those with a higher potential of fraud. to create variables that maximize the differences in

independent variables, while simultaneously minimizing In statistics, the population data set can be analyzed to the similarities in the data to detect outliers which are characterize the location and variability of a particular indicators of potential fraud and abuse. This effect could variable. The skewness is a measure of asymmetry in the be described as data mining for the purpose of data. Kurtosis is a measure of whether the distribution “squeezing and pulling out” the potential fraud from the curve of the data is peaked or flat. For the purposes of data set. creating an outlier variable, which is a superior detector

of potential fraud, the kurtosis signifies the degree of

similarities of the variables in the data, and the skewness

represents the differences.

“According to the National Insurance Crime Bureau

(NICB), a nonprofit organization supported by 1,000

property and casualty carriers, insurance fraud is

perennially the second-most common white collar crime

behind tax evasion, and costs the U.S. public roughly $30

billion in property and casualty claims alone.”

Healthcare fraud is estimated at over $100 billion

annually. Securities fraud is estimated at over $15 billion

per year. The Association of Certified Fraud Examiners

put the annual cost of occupational fraud and abuse in

the United States at $600 billion in 2002, up from $400

billion in 1996.

Fraud Variables

Fraud and Abuse

Table 1. Kurtosis and Skewness

Kurtosis = 3

Skewness = 0

3For example, if we analyze the number of distinct

number of patients within a group of medical providers

and we want to know how similar some of the providers

are to other providers in the number of patients we

would examine the kurtosis. Hence, the kurtosis would

measure the extent of similarities in the data. On the

other hand, the skewness would represent how far away

are some providers from the similarities in the data (i.e.,

distinct number of patients). Chart 1 is a representation

of a Kurtosis and Skewness in a normal distribution.

There are different types of variables that can be used to

conduct potential fraud analysis on a population. The

variables can be classified as seed, complex, and

weighted outlier variables. The seed variables are those

that come directly from the data. For example, in the

detection of healthcare fraud, the distinct number of

claims (ICN) or the distinct number of first date of

service (FDOS) can be considered seed variables. Hence,

these variables can be expressed as follows:

Complex variables refer to ratios or statistical formulas

derived from the seed variables. For example, the

number of distinct claims per the distinct number of first

date of services (i.e., claims per day) is a complex

variable, and can be expressed as:

However, consistent with widespread inconsistent and

ambiguous terminology, the square root of the bias-

corrected variance is sometimes also known as the

standard deviation.

The standard deviation arises naturally in mathematical

statistics through its definition in terms of the second

central moment. However, a more natural but much less

frequently encountered measure of average deviation

from the mean that is used in descriptive statistics is the

so-called mean deviation. “

Another complex variable, can be the Z score of a

variable in a population. The Z score is an observation

that measures the number of standard deviations away

from the mean. “The z-score associated with the ith

observation of a random variable x is given by

I= ICN or distinct number of claims

F= FDOS or distinct number of first date of servicewhere x is the mean and ó the standard deviation of all

observations x , .... x .”1 n

The weighted outlier variable takes into consideration

the relationship between the purpose and means of

potential fraud to create an independent variable. In its

most simple terms, it can be expressed as how the means A complex variable can be the mean of a population.

(M) affects the purposes (P) of fraud. It can be expressed “The quantity commonly referred to as "the" mean of a

as:set of values is the arithmetic mean

It combines seed and complex variables to create a new

independent variable that maximizes the differences also called the average.” and minimizes the similarities in potential fraud

schemes. This is done by simultaneously increasing the Another complex variable can be the standard deviation Kurtosis and Skewness in a distribution. Table 2 is a in a data set of a population. “The standard deviation ó graphical representation of the kurtosis and skewness of of a probability distribution is defined as the square root

2 a typical seed or complex variable. of the variance ó

The square root of the sample variance of a set of N

values is the sample standard deviation

The sample standard deviation distribution is slightly

complicated, although it is a well-studied and well-

understood, function.

The Weighted Outlier

ICN

FDOS

x =1

n

n

i=1

xn

s =N

1

N

N

i=1

(x 2-x)i

s =N-1

1

N-1

N

i=1

(x 2-x)i

z =ix - xi

ó

Purposes

Means

Table 2. Seed or Complex Variable: Kurtosis and Skewness

Kurtosis = 3

Skewness = 0

Seed or Complex

4

Table 3 represents a comparison of the kurtosis and The weighted outlier variable maximizes the differences

skewness of the weighted outlier, and the seed or in variables between providers providing similar services

complex variable. As the chart shows the weighted to beneficiaries and minimizes the similarities in the

outlier formula has the effect of increasing the kurtosis potential fraud schemes.

while simultaneously increasing the skewness, and in the

process of “squeezing and pulling out” the entity with a

high potential for fraud is identified.

Services not rendered (SNR) to beneficiaries;

An example of how the weighted outlier formula works in

detecting fraud can be illustrated using healthcare data

from Medicare. Medicare fraud comes is variable in

shapes and form. Fraud schemes are difficult to detect

given the number of variables involved: beneficiaries,

providers, medical procedures, amounts paid to

providers, days of services, number of services, and

diagnoses are just some of the basic variables with

millions of data points. The issue is how to create

variables that assist in data mining the vast databases of

Medicare claims information by separating the outliers

in the data to detect potential fraud among providers.

The potential provider-based Medicare fraud schemes

are, but are not limited to:

1.

2. Unnecessary services (UNS) to beneficiaries;

3. Impossible days (IMD), or providers billing for more

hours in a day than is probable;

4. Illegal Self-referrals by a provider for unnecessary

services to beneficiaries; or

5. An illegal financial relationship (IFA) between the

referring provider and the rendering provider; or

6. Sharing of beneficiaries between the referring and

rendering providers.

The aforementioned fraud schemes are difficult to The weighted outlier formula is expressed as:detect in a vast database, and they are even more

difficult if a provider uses a combination of one or more

fraud schemes to perpetuate and conceal Medicare

fraud (see table 4). Some of these schemes overlap with Z= The Z score of an independent variable which one another and the challenge is to create variables that

denotes the purposes of a fraud maximizes the differences and minimizes the

similarities in the potential fraud schemes. A= Independent (seed) variable which denotes the

means to perpetuate a fraud

B= Independent (seed) variable which denotes the

means to perpetuate a fraud

Potential Fraud Schemes

Medicare Fraud

Table 3. Comparison of Weighted Outlier, and Seed or

Complex Variable: Kurtosis and Skewness.

Kurtosis = 12

Skewness = 0

Seed or Complex

Kurtosis = 3

Skewness = 197

Fraud

Weighted Outlier

2

BA

ZX=

Table 4 – Interaction of Potential Fraud Schemes

UNS Services

SNR Rendered

IMD

IFARelationships

BeneSharing

Beneficiaries

Self -Referrals

5Healthcare Variables income than providers practicing in rural areas. Hence,

the skewness and kurtosis of seed variable distributions

are not necessarily the accurate indicators of potential A combination of seed variables, basic ratios, and

fraud.weighted outlier variables are essential to detect

potential healthcare fraud. The seed variables are

extracted directly from the data. The seed variables are Another method of identifying potential fraud schemes

the distinct count of the following: PIN; first day of includes analyzing the relationship, or ratio, between

different variables. These variables are called complex service (FDOS); Beneficiaries; diagnosis (Dx); medical

variables. The complex variables used to detect fraud are procedures (CPT); place of service (POS); claims (ICN); ICN per beneficiary; ICN per FDOS; beneficiaries per Referring UPIN; as well as the provider paid amount and FDOS; Beneficiaries to Referring UPIN; Medical the total number of units. The distribution of seed Procedures to Referring UPIN; and Diagnoses to

variables, although useful in detecting healthcare fraud, Referring UPIN. The variables listed are derived from

independently they are not necessarily indicative of the seed variables: (1) are correlated; (2) sometimes are

fraud. One of the main reasons seed variables alone fail not normally distributed; and (based on experience)

to predict fraudulent activity is that they can be affected assist in detecting potential fraud. The mean, standard by the size and diversity of a provider practice. For deviation, and Z scores are also examples of complex example, a provider with a large practice in an urban area variables.might have a greater patient volume, thus greater

Variable ICN/Bene ICN/FDOS Bene/FDOS Bene/RefUPIN CPT/RefUPIN Dx/RefUPIN

ICN/Bene 1 0.212130786 -0.15694424 -0.02267558 0.319601875 0.049493142

ICN/FDOS 0.212130786 1 0.822651598 0.158268848 -0.2099538 -0.23760167

Bene/FDOS -0.15694424 0.822651598 1 0.230730507 -0.32004858 -0.24604582

Bene/RefUPIN -0.02267558 0.158268848 0.230730507 1 0.093916984 0.116711104

CPT/RefUPIN 0.319601875 -0.2099538 -0.32004858 0.093916984 1 0.791509512

Dx/RefUPIN 0.049493142 -0.23760167 -0.24604582 0.116711104 0.791509512 1

Table 5 – Complex Variables Correlation

These complex variables are considered indications of fraud because they may signify the practice of one or more of the following potential fraud schemes :

Self-referrals by a provider for unnecessary services

to beneficiaries; or

b. An illegal financial relationship between the

referring provider and the rendering provider; or

c. Sharing of beneficiaries between the referring and

rendering providers.

d. Provider activity may qualify as an impossible days

scenario; or

e.. Provider may be billing for unnecessary services; or

f. Provider may be billing for services not rendered.

Table 6 shows the skewness and kurtosis of all the

variables in our data set. It clearly shows, as expected (in

order to detect potential fraud), an increase in the

skewness and kurtosis between the seed and complex

variables. For example, the “squeezing and pulling out”

effect occurs when we measure claims per day (complex

variable) vis-à-vis the number of claims (seed variable) or

the number of days of service (seed variable). An

important observation is that the increase in skewness

and kurtosis is even more significant between the

complex variables and the weighted outlier variables.

a.

Table 6 – Skewness and Kurtosis of Seed, Complex and

Weighted Outlier Variables

Skewness

Kurtosis

Seed

Units

2.174538

7.3879808

Benes

1.9793898

5.1381806

Diagnosis

0.8959197

-0.213163

FDOS

0.3028637

-1.52939

CPT

0.7194892

-0.791148

ICN

1.8485883

4.2930173

POS

0.3078353

-0.648475

RefUPIN

2.1780747

5.987456

Ratios

ICN/Bene

3.7234978

18.172842

ICN/FDOS

2.3585795

10.22756

Bene/FDOS

2.5732777

10.931942

Bene/RefUPIN

19.558957

396.93533

CPT/RefUPIN

2.5717062

8.506685

Dx/RefUPIN

2.7660013

10.059753

Dependent Variables

ProvPaid

5.4782115

56.995661

Z Score 1

5.4735533

56.867998

Wt Outlier

Z/Bene

13.520514

195.3972

Z/Dx

12.929682

178.89963

Z/CPT

12.746727

188.7118

Z/ICN

7.5788937

90.62649

Z/BeneF

14.378765

251.57888

Z/ICNB

15.519089

281.08893

1 This is the z score of the provider paid amount.

6

Table 7 shows a comparison of the mean of the skewness

in the different categories of variables. This comparison

shows that there was a skewness increase of over 292%

between the seed variables and the complex variables

(ratios); and an increase of over 285% in the skewness

between the seed ratios and the Z score of the provider

paid amount (complex variable). On the other hand,

there was an increase in the skewness of over 667%

between the seed variables and the weighted outlier

variables; an increase of over 228% between the

complex variables (ratios) and the weighted outlier

variables; and an increase of over 233% between the Z

score of the provider paid amount and the weighted

outlier variables.

Table 8 shows a comparison of the mean in the kurtosis

by the different categories of variables. This comparison

shows that there was clear and substantial increase

between the kurtosis and skewness in the different

categories of variables.an increase in kurtosis of over

890% between the seed variables & the complex

variables (ratios); and an increase of over 691% in the

kurtosis between the seed ratios and the Z score of the

provider paid amount (complex variable). On the other

hand, there was an increase in the kurtosis of over

2,322% between the seed variables and the weighted

outlier variables; an increase of over 260% between the

complex variables (ratios) & the weighted outlier

variables; and an increase of over 335% between the Z

score of the provider paid amount and the weighted

outlier variables.

Table 7. Skewness Comparison: Variables by Category

Talbe 8. Kurtosis Comparison by Mean

7

The increase in the skewness and kurtosis between the Table 9 represents a comparison of the beneficiaries per

seed and complex variables may represent an day, the Z score of the provider paid amount, and the

explanation as to why complex variables incrementally weighted outlier of the beneficiaries per day. It

augment the seed variables, in the analysis for the compares the mean of the three variables in the

detection of healthcare fraud. Hence, the weighted population, with providers that score high or low in the

outlier variables also incrementally augment the different categories of variables.

forecasting and detection of fraud of the seed and

complex variables.

Table 9. Variables Comparison Specialty X

A provider that has a low number of beneficiaries (or Z score of the provider paid amount. In this scenario the

patients) and a low number in the Z score, also scores low suspected potential fraud comes to the top since, as

in the weighted outlier variable. It is to be expected that compared to the mean of the population, it is not

a low number of beneficiaries correlates to a low Z score expected that a provider that has a low number of

of the provider paid amount (i.e., lower number of beneficiaries to have a higher provider paid amount (low

beneficiaries = lesser amount paid to provider) as it number of beneficiaries ≠ higher amount of provider compares to the mean of all providers. Therefore, a paid amount). The weighted outlier correctly maximize relevant weighted outlier should minimize this similarity this difference.in the data.

A provider that has a high number of beneficiaries and a

high Z score of the provider paid amount, scores

somewhat higher than the mean of all providers. This is

to be expected since a provider that has a higher number

of beneficiaries than the mean of the population should

be expected to be paid more than average in the

population (higher number of beneficiaries = higher

amount paid to provider). Again, the weighted outlier

rightly minimize these similarities.

The usefulness of the weighted outlier is seen when a

provider has a lower number of beneficiaries and a high

Data Modeling

The purpose of data modeling in fraud detection is to

develop an accurate model, or graphical representation,

which have the potential to predict the potential for

fraud among the entities within a population. Different

techniques are used to model data, which include, but

are not limited to: (1) classification and regression

analysis are used in the task of predicting a response

variable; (2) clustering (grouping the rows by

similarities); and (3) association (showing that the

variables are related). The weighted outlier increases the

data models predictive function for fraud detection.

8

The rank, RSquare, and adjusted RSquare are examples score of provider paid amount, and weighted outlier. of how weighted outlier variables make for a better The table illustrates the ability of the weighted outlier fitting model. Table 10 shows a comparison of the rank to increase the rank of a provider who has the of the provider by number of beneficiaries (bene), potential for fraud vis-à-vis seed and complex distinct number of first day of service (FDOS), variables.beneficiaries per first day of service (bene/FDOS), Z

Table 10. Rank Comparison

Provider Benes Rank

FDOS Rank

Benes/FDOS Rank

Zscore ProvPd Rank

Wt Outlier Rank

Alpha1 215 94 385 3 1

A regression analysis was performed of the seed model when the weighted outliers are added to the

variables; seed and complex variables; and seed, model. The increase of over ten percent is significant to

complex and weighted outlier variables respectively. the predictive value of the model. Hence, it could be

Table 11 shows a comparative analysis of the three inferred that the addition of the weighted outlier

different regression models by RSquare and RSquare variables (to the seed and complex variables) enhances

Adjusted. These results indicate that an increase occurs the model, and makes it more efficient and accurate in

in the RSquare and Adjusted RSquare to the regression the detection of potential fraud.

Table 11. Comparative Analysis of the RSquare and RSquare Adjusted, by model

9

Table 12. Regression Analysis of Seed Variables

Response N Sum of Provider Paid Amt Whole Model

Actual by Predicted Plot

10

Table 13. Regression Analysis – Seed and Complex Variables



Table 14. Regression Analysis – Seed, Complex, and Weighted Outlier Variables



11

Conclusion

The weighted outlier variable formula achieves the “squeezing and pulling out” effect by minimizing the similarities

and maximizing the differences in the data by simultaneously increasing the kurtosis and skewness vis-à-vis seed and

complex variables. The weighted outlier is an independent variable which also has the potential for improving data

modeling. The detection of fraud in private industry, as well as in government can be improved through the utilization

of weighted outlier variables.

12

World Headquarters

500 Frank W. Burr Boulevard,Teaneck, NJ 07666 USAPhone: +1 201 801 0233Fax: +1 201 801 0243Toll Free: +1 888 937 3277Email: [email protected]

European Headquarters

Haymarket House28-29 HaymarketLondon SW1Y 4SP UKPhone: +44 (0) 20 7321 4888Fax: +44 (0) 20 7321 4890Email: [email protected]

India Operations Headquarters

#5/535, Old Mahabalipuram RoadOkkiyam Pettai, ThoraipakkamChennai, 600 096 IndiaPhone: +91 (0) 44 4209 6000Fax: +91 (0) 44 4209 6060Email: [email protected]

© Copyright 2009, Cognizant. All rights reserved. No part of this document may be reproduced, stored in a retrieval system, transmitted in any form or by any means, electronic, mechanical, photocopying, recording, or

otherwise, without the express written permission from Cognizant. The information contained herein is subject to change without notice. All other trademarks mentioned herein are the property of their respective owners.

DWBI&PM Practice at CognizantThe Data Warehousing, Business Intelligence & Performance Management Practice is a single-point Center of

Excellence within Cognizant for designing and deploying full-fledged DWBI&PM solutions. With more than 5,610*

consultants across the globe, Cognizant's award-winning DWBI&PM practice is at the forefront of partnering leading

companies around the world in architecting pragmatic, business-focused, enterprise-wide BI solutions. The practice

has been recognized for its role in enabling BI excellence through prestigious industry awards, including three

Computerworld BI Perspectives Best Practices Awards, the DM Review Innovative Solution Award, the TDWI Award,

the Cognos Performance Leadership Award, the Cognos Excellence Award, and the Informatica Innovation Award.

Note:

For more information on Cognizant's DWBI&PM solutions, contact us at or visit our website at [email protected] http://www.cognizant.com

th * As of 30 Apr '09

About the Author

Alberto Roldan is a thought leader in Enterprise Analytics within DWBI&PM Practice. He has over 20 years experience

designing analytics solutions for organizations with large and complex technology landscape. He specializes in adapting

proven analytics techniques and methods to real world data intensive problems in neuroscience, medicine, physics and

chemistry. He has degrees from the University of Michigan and University of Puerto Rico.