fraud detection and prevention: leveraging machine ... · fraud: areas and types of fraud •...

#1 Agile Predictive Analytics Platform for Today’s Modern Analysts

RapidMiner Wisdom 2018 – New Orleans, LA, USA, October 12th, 2018

Ralf Klinkenberg, Founder & Head of Data Science Research, RapidMiner

rklinkenberg@rapidminer.com

www.RapidMiner.com

Fraud Detection and Prevention: Leveraging Machine Learning to Detect Fraud Patterns, Anomalies, and Unusual Behaviors

Creating Value from Big Data

Fraud – Areas & Types & Relevance

Machine Learning for Fraud Detection & Prevention

Credit Card Fraud Detection & Prevention

4. Healthcare Fraud Detection & Prevention

Fraud: Areas and Types of Fraud

• Credit Card Fraud

• Tax Fraud

– EU: Value Added Tax (VAT) Fraud in Transactions withinNetworks of Companies

– Income Tax Fraud / Corporate Tax Fraud

• Tax Fraud

– EU: Value Added Tax (VAT) Fraud in Transactions withinNetworks of Companies

– Income Tax Fraud / Corporate Tax Fraud

• Fraud in Supply Chains, Retail Networks, Purchase Departments, Procurement

• Tax Fraud– EU: Value Added Tax (VAT) Fraud in Transactions within

Networks of Companies– Income Tax Fraud / Corporate Tax Fraud

• Fraud in Supply Chains, Retail Networks, Purchase Departments, Procurement

• Insurance Fraud:– Car Insurance (Faked Accidents)

– Fire Insurance

– Healthcare Insurance

Fraud: Healthcare Insurance Fraud

• Example: Medicaid/Medicare in the USA: 1 US State alone: 6 billion US$ budget per year => estimated 10-20% fraud & waste=> 1 billion US$ per year lost

Fraud: Healthcare Insurance Fraud

• Example: Medicaid/Medicare in the USA: 1 US State alone: 6 billion US$ budget per year => estimated 10-20% fraud & waste=> 1 billion US$ per year lost

• Fraudulent Patients (e.g. Drug Addicts/Dealers/Resellers)

• Fraudulent Doctors

• Fraudulant Pharmacies / Hospitals / Service Providers / Suppliers

• Individuals as well as Networks of Fraudsters

Fraud: Challenges for Fraud Detection

• Large Number of Potential Types and Areas of Fraud

• Intelligent and Constantly Improving Adversaries

• Changing Fraud Patterns and Types

• Large Amounts of Potentially Relevant Data

• Large Variety of Potentially Relevant Data Sources & Types– Structured and Unstructured Data: Transactions, Time Series Data,

Textual Data, Network Data, Entity Relations, etc.

• Limited Resources for Fraud Detection & Prevention– Which cases to investigate (first / at all)?

– Prioritize & focus to maximize effectiveness & efficiency

Fraud: Known vs. Unknown Types of Fraud

• New instances of known types of fraud should beautomatically identified

• New instances of known types of fraud should beautomatically identified:

=> use Machine Learning to automatically find patterns(in data from the past with known fraud cases)

=> use Machine Learning to automatically find patterns

=> deploy generated models to automatically identify new cases

=> use Machine Learning to automatically find patterns

=> deploy generated models to automatically identify new cases

• But what about new types of fraud?

Machine Learning forFraud Detection

Predictive Analytics Transforms Insight into ACTION

Descriptive

Diagnostic

Predictive

Prescriptive

OBSERVEWhat happened

EXPLAINWhy did it happen

ANTICIPATEWhat will happen

ACTOperationalize

Metrics & Indicators for Fraud Risk

• Domain experts often know metrics that may be indicative of a high risk of fraud

Metrics & Indicators for Fraud Risk

• Domain experts often know metrics that may be indicative of a high risk of fraud => incorporate into entity features

• Examples:

– Entity = Patient:

▪ Total Payments Received,

▪ Number of Prescriptions,

▪ Number of Doctors Visited,

▪ Number of Pills per Month, etc.

– Entity = Prescriber (e.g. Doctor):

▪ Total Payments Received, Number of Patients per Month, Amount per Patient, etc.

– Entity = Service Provider (e.g. Pharmacy, Hospital, etc.):

▪ Total Payments Received, Price per Unit, Price per Treatment of Type X, etc.

Comparison to Peer Groups

• Does a high value of „Total Amounts Prescribed“ automaticallymean the entity (e.g. doctor) is fraudulent?

• No.

• No, but a high total amount prescribed my indicate ahigh risk of fraud.

• No, but a high total amount prescribed my indicate a high risk of fraud.

• Oncologists often need to prescribe expensive anti-cancerdrugs=> oncologists may have higher „Total Amounts Prescribed“

than other types of doctors (specializations)=> compare a doctor‘s metric to the average value of his/her

peers (and not to the average for all doctors) => ratio.

Leverage Fraud Risk Indicators

• Does a high value of „Total Payments Received“ automaticallymean the entity (e.g. doctor) is fraudulent?

• No, but a high total amount received my indicate a high risk offraud.

• => Rank entities by value of key metrics => suspects

Combined Fraud Risk Indicators

• => Combine metrics (e.g. weighted sum): Fraud Risk Score=> Rank entities by value of combined metric => suspects

Leverage Fraud Risk Indicators

• => Combine metrics (e.g. weighted sum): Fraud Risk Score=> Rank entities by value of combined metric => suspects

• No machine learning yet, but an often used initial solution torank and prioritize entities for review / audits / investigation

• => more effective & efficient use of resources (auditors)

Classification

Algorithms to predict classes(Fraud / No Fraud)

Grouping

Group similar items together(Segmentation, Clustering, Item Sets,Association Rules, Sequence Analysis,

Network Analysis)

Anomaly Detection

Find outliers in your data(unusual behaviors)

Regression

Algorithms to predict numbers(Fraud Risk Scores or Expected Values)

Automation

Optimization

Deployment

Feature Extraction

&Selection

Unsupervised Learning

Supervised Learning

Machine Learning: Supervised vs. Unsupervised

• Supervised Machine Learning:– Data from the past with known fraud and non-fraud cases (label);

– Machine Learning of Classification models or Association rules to find fraud patterns from the past and to automatically identify newinstances of these fraud types in new data;

– Applicable to known fraud cases, patterns, and types.

Machine Learning: Supervised vs. Unsupervised

• Supervised Machine Learning:– Data from the past with known fraud and non-fraud cases (label);– Machine Learning of Classification models or Association rules to find

fraud patterns from the past and to automatically identify new instancesof these fraud types in new data;

– Applicable to known fraud cases, patterns, and types.

• Unsupervised Machine Learning:– Clustering (Segmentation): Grouping entities into clusters of similar

entities (patients, doctors, service providers, etc.);– Anomaly Detection / Outlier Detection: detect unusual behaviors;– Both depend on selected attributes, normalization and/or weighting;– Attribute Weighting can be used to incorporate domain knowledge and/or

priorities;– Allows to find previously unknown types of fraud.

Fraud Detection and Prediction

Fraud Detection with Machine Learning

• Step 1: Finding Known Fraud Patterns by Embedding Domain Expert Knowledge: Fraud Risk Scoring & Ranking of Entities=> From Random Checks to

Systematic Automated Checks & Prioritization: => Data Mining to Automate Fraud Detection

• Step 2: Identifying Known Fraud Patterns with Machine Learning and Automatically Detecting Them in the Future:Supervised Learning:

– Automated Classification

– Risk Score Regression

– Association Rule Generation

• Step 2: Identifying Known Fraud Patterns with Machine Learning and Automatically Detecting Them in the Future:Supervised Learning: Automated Classification, Risk Score Regression, Association Rule Generation

• Step 3: Identifying Previously Unknown Fraud Cases or Patterns: Unsupervised Learning: Anomaly Detection, Outlier Detection

• Step 1: Finding Known Fraud Patterns by Embedding Domain Expert Knowledge: Fraud Risk Scoring & Ranking of Entities=> From Random Checks to Systematic Automated Checks & Prioritization: => Data Mining to Automate Fraud Detection

• Step 3: Identifying Previously Unknown Fraud Cases or Patterns: Unsupervised Learning: Anomaly Detection, Outlier Detection

• Step 4: Comparison with Expectations: Predict Volumes & Prices and Compare with Actual Medications

• Step 3: Identifying Previously Unknown Fraud Cases or Patterns: Unsupervised Learning, Anomaly Detection, Outlier Detection

• Step 5: Adversial Machine Learning / Text Analytics / Process Mining

• Step 3: Identifying Previously Unknown Fraud Cases or Patterns: Unsupervised Learning, Anomaly Detection, Outlier Detection

• Step 5: Adversial Machine Learning / Text Analytics / Process Mining• Step 6: (Semi-)Automated Audits (Auditors Remain in Control)

Credit Card FraudCredit Card Fraud

Meta Data

Amount

Location

Receiver

TimeStamp

CardId

RandomUnsupervised(Semi) Supervised

- 38 -

Three Method’s to Combine

Card-Number (ID) Probability

RandomUnsupervised(Semi)

Supervised

Challenge I

Being good at detecting known patternsvs

Seeing the new and unknown

Challenge II

Detection Rate

Transforming transactional data (e.g., purchase/date) into a table (RapidMiner Example Set)

Data aggregation and enrichment

=> Creating a profile of the customer

Being good at detecting known patterns

Unsupervised

(Semi-) Supervised Learning

Detection rate is critical

Relatively few fraud cases compared to thousands of legit transactions

Now we have a profile, what do we do with it?

Rule based

Daily amount < 500€ p.d.

Local Outlier Factor (LOF)

Distance based algorithm for outlier

detection

Source: https://en.wikipedia.org/wiki/Local_outlier_factor

Supervised

Random Forrest, SVM, …

Being good at detecting known patterns

How my customer profile should look like

Class balance:

Relatively few fraud cases

compared to thousands of legit

transactions

Local Outlier Factor (LOF)

Distance based algorithm for outlier detection

Incorporates the concept of local density

(similar to DBSCAN clustering)

Calculated scores are comparable Source: https://en.wikipedia.org/wiki/Local_outlier_factor

Rule Based Systems

A fixed set of rules for classifying events

Classic example: Naïve Bayes for detecting spam mails

HypGraphs and HypTrails

Bayesian Methods for comparing hypothesises of sequential data

Can be applied on transition networks

Healthcare Fraud Detection

RapidMiner Demo

The Challenge

RapidMiner Solution

Outcome

Safeguarding Electronic Payments

• Protecting against fraud and anticipation of risk 7x24

• Large and diverse set of partners (merchants) – over 70,0000

• How to classify and check merchant ecommerce sites for payment system compliance?

• Analyze, classify and check merchants’ ecommerce sites for compliance

• Utilize text mining with NLP to auto-categorize with high sentiment accuracy

• Mashup the widest data sets - historical data on service usage, transaction history, customer profiles, usage logs, and known cases of fraudulent behavior

• Detect anomalies, misuse and fraud through operationalized classification model

• Only 8-10% of merchant sites now screened manually at 80% confidence threshold

• Accurate automated analysis of high risk sites- 92% correctly classified

• Elimination of false positives - no normal sites classified as high risk

• Time and cost to resolve fraud case radically reduced

Anticipating the risk of fraud

Russia’sLargest

Electronic Payment Service

Process Mining & Fraud Detection

• Insurance Claims & Payments Leave Footprints and Audit Trails:– Contracts– Claim reports / incidents– Payments / transactions– Individuals & organisations involved– IT system log files

• Use Process Mining to :– Collect– Normalize– Correlate– Analyze

• RapidMiner RapidProM Extension on the RapidMiner Marketplace

• Financial Audits– Compliance / regulatory audits

– Operational audits

– Transactional services (M&A)

• Purchase Processes & Procurement

• IT Audits– IT Service management

– Cyber security

– Systems compliance

– IT forensic services

• Manufacturing– Identifying assembly bottlenecks

Process Mining – RapidMiner with RapidProM

ProcessTask 1

ProcessTask 2

ProcessTask 3a

IF/THEN

ProcessTask 3b

ProcessTask 4

ProcessTask 5

Appl. A Appl. B Appl. B Appl. B

Appl. B Appl. C Appl. C

…200612 10:30 User0015 Task1 Case0099260612 23:01 User4801 Task1 Case0223

…200612 10:31 User0015 Task2 Case0099200612 10:35 User0015 Task3b Case0099 …

…200612 10:37 System Task4 Case0099200612 10:38 System Task5 Case0099

Log File App A Log File App B Log File App C

Log File Normalizationand Merge

Process LogData Lake

RapidMiner with

Process Documentation(Bottom up model generation, determination of reference processes)

Social CollaborationSocial Graphs Analysis

Process Harmonization(Compare against to-be processes and show deltas)

Process Optimization(Runtime Analysis, late runners, waiting times, unexpected stops, congestion)

http://www.rapidprom.org

- 50 -CONFIDENTIAL

Thanks for your Attention!

Ralf Klinkenberg

www.RapidMiner.com

RapidMiner Wisdom 2018 – New Orleans, LA, USA, October 12th, 2018

Ralf Klinkenberg, Founder & Head of Data Science Research, RapidMiner

www.RapidMiner.com

Fraud Detection and Prevention: Leveraging Machine Learning to Detect Fraud Patterns, Anomalies, and Unusual Behaviors

fraud detection and prevention: leveraging machine ... · fraud: areas and types of fraud •...

Documents

tax action plan for fair and simple taxation supporting...

intra-community vat fraud · vat fraud is ‘missing trader...

haryana vat rate of tax

vat and service tax

trade-based money...

draft of new vat act 2011 - vat bd - bangladesh tax

the kingdom of bahrain value added tax (vat) …...the...

value added tax (vat) - icpak · pdf filevalue added tax...

tackling intra-community vat fraud: more action needed

vat tax cases

vat (value added tax)

tackling intra community vat fraud - ejtn

tax cases on vat

the concept of tax gaps - european...

vat fraud mutation, part 3

the concept of tax gaps - european commission · actions...

sales tax & vat

25 service tax n vat

vat fraud a global challenge - vatbox

measures to counter vat carousel fraud