fraud analysis and other applications of unsupervised learning in property and casualty insurance

47
Applications of Unsupervised Learning in Property and Casualty Insurance with emphasis on fraud analysis Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com [email protected]

Upload: salford-systems

Post on 22-May-2015

2.031 views

Category:

Technology


1 download

TRANSCRIPT

Page 1: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

Applications of Unsupervised Learning in Property and Casualty Insurancewith emphasis on fraud analysis

Louise Francis, FCAS, MAAA

Francis Analytics and Actuarial Data Mining, Inc.

www.data-mines.com

[email protected]

Page 2: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

Objectives

Review classic unsupervised learning techniques

Introduce 2 new unsupervised learning techniques RandomForest PRIDIT

Apply the techniques to insurance data Automobile Fraud data set A publically available automobile

insurance database

Page 3: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

Motivation for Topic

New book: Predictive Modeling in Actuarial Science An introduction to predictive modeling for

actuaries and other insurance professionals Publisher: Cambridge University Press Hope to Publish: Fall 2012 Chapter on Unsupervised Learning Li Yang and Louise Francis

Li Yang – Variable grouping (PCA) Louise Francis- record grouping (clustering)

Page 4: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

Book Project Predictive Modeling 2 Volume Book Project   A joint project leading to a two volume pair of books on

Predictive Modeling in Actuarial Science. Volume 1 would be on Theory and Methods and Volume 2 would be on Property and Casualty

Applications. The first volume will be introductory with basic

concepts and a wide range of techniques designed to acquaint actuaries with this sector of problem solving techniques. The second volume would be a collection of applications to P&C problems, written by authors who are well aware of the advantages and disadvantages of the first volume techniques but who can explore relevant applications in detail with positive results.

Page 5: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

5

The Fraud Study Data

• 1993 AIB closed PIP claims• Dependent Variables

• Suspicion Score• Expert assessment of liklihood of

fraud or abuse• Predictor Variables

• Red flag indicators• Claim file variables

Page 6: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

6

The Fraud Problemfrom: www.agentinsure.com

Page 7: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Da Mining, Inc.

7

The Fraud Problem (2)from Coalition Against Insurance Fraud

Page 8: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

8

Fraud and Abuse

Planned fraud Staged accidents

Abuse Opportunistic Exaggerate claim

Page 9: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

The Fraud Red Flags

Binary variables that capture characteristics of claims associated with fraud and abuse

Accident variables (acc01 - acc19) Injury variables (inj01 – inj12) Claimant variables (ch01 – ch11) Insured variables (ins01 – ins06) Treatment variables (trt01 – trt09) Lost wages variables (lw01 – lw07)

Page 10: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

10

The Red Flag VariablesR e d F lag V ariab le s

I n d ic a to r S u b jec t V ariab le D e s c rip tio n Accident A C CO 1 No report by police officer at scene

A 0 0 0 4 Single vehicle accident A 0 0 0 9 No plausible explanation for accident ACC10 Claimant in old, low valued vehicle ACC11 Rental vehicle involved in accident ACC14 Property Damage was inconsistent with accident ACC15 Very minor impact collision ACC16 Claimant vehicle stopped short ACC19 Insured felt set up, denied fault

Claimant CLT02 Had a history of previous claims C L T 0 4 W as an o u t o f s ta te acc id en t

CLT07 Was one of three or more claimants in vehicle In ju ry INJO1 In ju r y c o n s is te d o f s t r a in o r s p ra in o n ly

IN J 0 2 No objective evidence of injury IN JO 3 Police report showed no injury or pain INJ05 No emergency treatment was given IN JO 6 N o n -em erge n cy t re atm e nt w as de layed INJ11 Unusual injury for auto accident

Insured IN S O 1 Had history of previous claims IN SO 3 Readily accepted fault for accident IN SO 6 Was difficult to contact/uncooperative IN SO 7 Accident occurred soon after effective date

Lost Wages LWO1 Claimant worked for self or a family member LW03 Claimant recently started employment

Page 11: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

11

Dependent Variable Problem Insurance companies frequently do

not collect information as to whether a claim is suspected of fraud or abuse

Even when claims are referred for special investigation

Solution: unsupervised learning

Page 12: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

12

Supervised Learning

Page 13: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

13

Dimension Reduction

ZipCodeFrequency

BIFrequency

PDFrequency

Comb

PolicyCountNonBusines

sUse

VehicleCountNonBusin

essUse SeverityBI SeverityPD90095 - 54.50 0.03 2.00 3.00 1,973.50 93741 - - - 1.00 1.00 90015 22.65 43.93 0.04 1.00 2.00 10,181.16 2,442.36 90067 15.53 44.41 0.04 3.00 6.00 13,146.57 2,565.56 90004 26.71 48.45 0.04 11.00 17.00 8,538.56 2,354.08

Page 14: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

14

The CAARP Data

This assigned risk automobile data was made available to researchers in 2005 for the purpose of studying the effect of change in regultion on territorial variables

contain exposure information (car counts, premium) and claim and loss information (Bodily Injury (BI) counts, BI ultimate losses, Property Damage (PD) claim counts, PD ultimate losses).

Each record is a zip code Good example of using unsupervised learning for

territory construction

Page 15: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

15

R Cluster Library

The “cluster” library from R used Many of the functions in the library

are described in the Kaufman and Rousseeuw’s (1990) classic bookon clustering. Finding Groups in Data.

Page 16: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

16

Grouping Records

Page 17: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

17

Dissimilarity

Euclidian Distance: the record by record squared difference between the value of each the variables for a record and the values for the record it is being compared to.

Page 18: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

18

RF Similarity Varies between 0 and 1 Proximity matrix is an output of RF

After a tree is fit, all records run through model If 2 records in same terminal node, their

proximity increased by 1 1-proximity forms distance

Can be used as an input to clustering and other unsupervised learning procedures

See “Unsupervised Learning with Random Forest Predictors” by Shi and Horvath

Page 19: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

19

Clustering

Hierarchical clustering K-Means clustering This analysis uses k-means

Page 20: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

21

K-means Clustering

An iterative procedure is used to assign each record in the data to one of the k clusters.

The iteration begins with the initial centers or mediods for k groups.

uses a dissimilarity measure to assign records to a group and to iterate to a final grouping. An iterative procedure is used to assign each record to one of the k clusters. by the user,

Page 21: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

22

R Cluster Output

Page 22: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

23

Cluster Plot

Page 23: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

24

Silhouette Plot

Page 24: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

Silhouette Plot RF Proximity

Page 25: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

Silhouette Plot – Euclidean Distance Clustering

Page 26: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

27

Testing using Expert Scores: Fit a Tree to Suspicion Score for Importance Ranking

Page 27: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

28

Importance Ranking of the Clusters

Page 28: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

29

Fit Tree to Binary Fraud Indicator

Page 29: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

30

Importance Ranking (2)

Page 30: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

31

RF Ranking of the “Predictors”: Top 10 of 44

Variable MeanDecreaseGini Description

acc10 10.50 Claimant in old low value vehical

trt01 9.05 arge # visits to chiro

inj01 8.64 strain or sprain

inj02 8.64 readily accepted fauld

inj05 8.62 non emergency treatment given for injury

acc01 8.55 no police report

clt07 7.47 one of 3 or more claimants in vehical

inj06 7.44 non emergency trt delayed

acc15 7.36 very minor collision

trt03 6.82 large # visits to PT

Page 31: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

Problem: Categorical Variables It is not clear how to best perform

Principal Components/Factor Analysis on categorical variables The categories may be coded as a

series of binary dummy variables If the categories are ordered

categories, you may loose important information

This is the problem that PRIDIT addresses

Page 32: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

RIDIT

Variables are ordered so that lowest value is associated with highest probability of fraud

Use Cumulative distribution of claims at each value, i, to create RIDIT statistic for claim t, value i

ˆ ˆti tj tjj i j i

R p p

Page 33: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

Example: RIDIT for Legal Representation

Legal Representation

Proportion Proportion

Value Code Number Proportion Below Above RIDITYes 1 706 0.504 0.000 0.496 -0.496No 2 694 0.496 0.504 0.000 0.504

Page 34: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

PRIDIT

Use RIDIT statistics in Principal Components Analysis

Component Matrixa

.248

.220

.709

.752

.341

.406

SIU

Police Report

At Fault

Legal Rep

Medical Audit

Prior Claim

1

Component

Extraction Method: Principal Component Analysis.

1 components extracted.a.

Page 35: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

36

PRIDITS of Accident Flags

Page 36: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

38

Fit Tree with PRIDITS for Each Type of Flag

Page 37: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

39

Importance Ranking of Pridits

Page 38: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

40

Importance Ranking of Factors

Page 39: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

41

Add RF and Euclid Clusters to PRIDIT Factors

Page 40: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

42

Use Salford RF MDS

Top variable in importance (acc10) used as binary dependent

Run tree with 1,000 forests Output proximities and MDS Use MDS scales as to cluster

(k=3) Run Tree to get Importance

ranking

Page 41: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

43

MDS Graph

Page 42: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

44

Rank of cluster procedures to Tree Prediction

Page 43: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

45

Labeling Clusters

Page 44: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

46

Relation Between PRIDIT Factor and Suspicion

Page 45: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

47

Next Steps

Add claim file variables Rerun clusters Rerun PRIDITS

Do Random Forest proximities on the RIDITS

Apply the procedures to other fraud databases

Page 46: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

PRIDIT REFERENCESAi, J., Brockett, Patrick L., and Golden, Linda L. (2009) “Assessing Consumer

Fraud Risk in Insurance Claims with Discrete and Continuous Data,” North American Actuarial Journal 13: 438-458.

Brockett, Patrick L., Derrig, Richard A., Golden, Linda L., Levine, Albert and Alpert, Mark, (2002), Fraud Classification Using Principal Component Analysis of RIDITs, Journal of Risk and Insurance, 69:3, 341-373.

Brockett, Patrick L., Xiaohua, Xia and Derrig, Richard A., (1998), Using Kohonen’ Self-Organizing Feature Map to Uncover Automobile Bodily Injury Claims Fraud, Journal of Risk and Insurance, 65:245-274

Bross, Irwin D.J., (1958), How To Use RIDIT Analysis, Biometrics, 4:18-38.Chipman, H.E.I. George and R.E. McCulloch, 2006, Baysian Ensemble Learning,

Neural Information Processing Systems

Lieberthal, Robert D., (2008), Hospital Quality: A PRIDIT Approach, Health Services Research, 43:3, 988–1005.

Page 47: Fraud Analysis and Other Applications of Unsupervised Learning in Property and Casualty Insurance

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

49

Questions?