fraud analysis and other applications of unsupervised learning in property and casualty insurance

Applications of Unsupervised Learning in Property and Casualty Insurancewith emphasis on fraud analysis

Louise Francis, FCAS, MAAA

Francis Analytics and Actuarial Data Mining, Inc.

www.data-mines.com

[email protected]

http://www.data-mines.com/

Objectives

Review classic unsupervised learning techniques

Introduce 2 new unsupervised learning techniques RandomForest PRIDIT

Apply the techniques to insurance data Automobile Fraud data set A publically available automobile

insurance database

Motivation for Topic

New book: Predictive Modeling in Actuarial Science An introduction to predictive modeling for

actuaries and other insurance professionals Publisher: Cambridge University Press Hope to Publish: Fall 2012 Chapter on Unsupervised Learning Li Yang and Louise Francis

Li Yang – Variable grouping (PCA) Louise Francis- record grouping (clustering)

Book Project Predictive Modeling 2 Volume Book Project A joint project leading to a two volume pair of books on

Predictive Modeling in Actuarial Science. Volume 1 would be on Theory and Methods and Volume 2 would be on Property and Casualty

Applications. The first volume will be introductory with basic

concepts and a wide range of techniques designed to acquaint actuaries with this sector of problem solving techniques. The second volume would be a collection of applications to P&C problems, written by authors who are well aware of the advantages and disadvantages of the first volume techniques but who can explore relevant applications in detail with positive results.

04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.

5

The Fraud Study Data

• 1993 AIB closed PIP claims• Dependent Variables

• Suspicion Score• Expert assessment of liklihood of

fraud or abuse• Predictor Variables

• Red flag indicators• Claim file variables


6

The Fraud Problemfrom: www.agentinsure.com

04/12/2023 Francis Analytics and Actuarial Da Mining, Inc.

7

The Fraud Problem (2)from Coalition Against Insurance Fraud


8

Fraud and Abuse

Planned fraud Staged accidents

Abuse Opportunistic Exaggerate claim

The Fraud Red Flags

Binary variables that capture characteristics of claims associated with fraud and abuse

Accident variables (acc01 - acc19) Injury variables (inj01 – inj12) Claimant variables (ch01 – ch11) Insured variables (ins01 – ins06) Treatment variables (trt01 – trt09) Lost wages variables (lw01 – lw07)


10

The Red Flag VariablesR e d F lag V ariab le s

I n d ic a to r S u b jec t V ariab le D e s c rip tio n Accident A C CO 1 No report by police officer at scene

A 0 0 0 4 Single vehicle accident A 0 0 0 9 No plausible explanation for accident ACC10 Claimant in old, low valued vehicle ACC11 Rental vehicle involved in accident ACC14 Property Damage was inconsistent with accident ACC15 Very minor impact collision ACC16 Claimant vehicle stopped short ACC19 Insured felt set up, denied fault

Claimant CLT02 Had a history of previous claims C L T 0 4 W as an o u t o f s ta te acc id en t

CLT07 Was one of three or more claimants in vehicle In ju ry INJO1 In ju r y c o n s is te d o f s t r a in o r s p ra in o n ly

IN J 0 2 No objective evidence of injury IN JO 3 Police report showed no injury or pain INJ05 No emergency treatment was given IN JO 6 N o n -em erge n cy t re atm e nt w as de layed INJ11 Unusual injury for auto accident

Insured IN S O 1 Had history of previous claims IN SO 3 Readily accepted fault for accident IN SO 6 Was difficult to contact/uncooperative IN SO 7 Accident occurred soon after effective date

Lost Wages LWO1 Claimant worked for self or a family member LW03 Claimant recently started employment


11

Dependent Variable Problem Insurance companies frequently do

not collect information as to whether a claim is suspected of fraud or abuse

Even when claims are referred for special investigation

Solution: unsupervised learning


12

Supervised Learning


13

Dimension Reduction

ZipCodeFrequency

BIFrequency

PDFrequency

Comb

PolicyCountNonBusines

sUse

VehicleCountNonBusin

essUse SeverityBI SeverityPD90095 - 54.50 0.03 2.00 3.00 1,973.50 93741 - - - 1.00 1.00 90015 22.65 43.93 0.04 1.00 2.00 10,181.16 2,442.36 90067 15.53 44.41 0.04 3.00 6.00 13,146.57 2,565.56 90004 26.71 48.45 0.04 11.00 17.00 8,538.56 2,354.08


14

The CAARP Data

This assigned risk automobile data was made available to researchers in 2005 for the purpose of studying the effect of change in regultion on territorial variables

contain exposure information (car counts, premium) and claim and loss information (Bodily Injury (BI) counts, BI ultimate losses, Property Damage (PD) claim counts, PD ultimate losses).

Each record is a zip code Good example of using unsupervised learning for

territory construction


15

R Cluster Library

The “cluster” library from R used Many of the functions in the library

are described in the Kaufman and Rousseeuw’s (1990) classic bookon clustering. Finding Groups in Data.


16

Grouping Records


17

Dissimilarity

Euclidian Distance: the record by record squared difference between the value of each the variables for a record and the values for the record it is being compared to.


18

RF Similarity Varies between 0 and 1 Proximity matrix is an output of RF

After a tree is fit, all records run through model If 2 records in same terminal node, their

proximity increased by 1 1-proximity forms distance

Can be used as an input to clustering and other unsupervised learning procedures

See “Unsupervised Learning with Random Forest Predictors” by Shi and Horvath


19

Clustering

Hierarchical clustering K-Means clustering This analysis uses k-means


21

K-means Clustering

An iterative procedure is used to assign each record in the data to one of the k clusters.

The iteration begins with the initial centers or mediods for k groups.

uses a dissimilarity measure to assign records to a group and to iterate to a final grouping. An iterative procedure is used to assign each record to one of the k clusters. by the user,


22

R Cluster Output


23

Cluster Plot


24

Silhouette Plot

Silhouette Plot RF Proximity

Silhouette Plot – Euclidean Distance Clustering


27

Testing using Expert Scores: Fit a Tree to Suspicion Score for Importance Ranking


28

Importance Ranking of the Clusters


29

Fit Tree to Binary Fraud Indicator


30

Importance Ranking (2)


31

RF Ranking of the “Predictors”: Top 10 of 44

Variable MeanDecreaseGini Description

acc10 10.50 Claimant in old low value vehical

trt01 9.05 arge # visits to chiro

inj01 8.64 strain or sprain

inj02 8.64 readily accepted fauld

inj05 8.62 non emergency treatment given for injury

acc01 8.55 no police report

clt07 7.47 one of 3 or more claimants in vehical

inj06 7.44 non emergency trt delayed

acc15 7.36 very minor collision

trt03 6.82 large # visits to PT

Problem: Categorical Variables It is not clear how to best perform

Principal Components/Factor Analysis on categorical variables The categories may be coded as a

series of binary dummy variables If the categories are ordered

categories, you may loose important information

This is the problem that PRIDIT addresses

RIDIT

Variables are ordered so that lowest value is associated with highest probability of fraud

Use Cumulative distribution of claims at each value, i, to create RIDIT statistic for claim t, value i

ˆ ˆti tj tjj i j i

R p p

Example: RIDIT for Legal Representation

Legal Representation

Proportion Proportion

Value Code Number Proportion Below Above RIDITYes 1 706 0.504 0.000 0.496 -0.496No 2 694 0.496 0.504 0.000 0.504

PRIDIT

Use RIDIT statistics in Principal Components Analysis

Component Matrixa

.248

.220

.709

.752

.341

.406

SIU

Police Report

At Fault

Legal Rep

Medical Audit

Prior Claim

1

Component

Extraction Method: Principal Component Analysis.

1 components extracted.a.


36

PRIDITS of Accident Flags


38

Fit Tree with PRIDITS for Each Type of Flag


39

Importance Ranking of Pridits


40

Importance Ranking of Factors


41

Add RF and Euclid Clusters to PRIDIT Factors


42

Use Salford RF MDS

Top variable in importance (acc10) used as binary dependent

Run tree with 1,000 forests Output proximities and MDS Use MDS scales as to cluster

(k=3) Run Tree to get Importance

ranking


43

MDS Graph


44

Rank of cluster procedures to Tree Prediction


45

Labeling Clusters


46

Relation Between PRIDIT Factor and Suspicion


47

Next Steps

Add claim file variables Rerun clusters Rerun PRIDITS

Do Random Forest proximities on the RIDITS

Apply the procedures to other fraud databases

PRIDIT REFERENCESAi, J., Brockett, Patrick L., and Golden, Linda L. (2009) “Assessing Consumer

Fraud Risk in Insurance Claims with Discrete and Continuous Data,” North American Actuarial Journal 13: 438-458.

Brockett, Patrick L., Derrig, Richard A., Golden, Linda L., Levine, Albert and Alpert, Mark, (2002), Fraud Classification Using Principal Component Analysis of RIDITs, Journal of Risk and Insurance, 69:3, 341-373.

Brockett, Patrick L., Xiaohua, Xia and Derrig, Richard A., (1998), Using Kohonen’ Self-Organizing Feature Map to Uncover Automobile Bodily Injury Claims Fraud, Journal of Risk and Insurance, 65:245-274

Bross, Irwin D.J., (1958), How To Use RIDIT Analysis, Biometrics, 4:18-38.Chipman, H.E.I. George and R.E. McCulloch, 2006, Baysian Ensemble Learning,

Neural Information Processing Systems

Lieberthal, Robert D., (2008), Hospital Quality: A PRIDIT Approach, Health Services Research, 43:3, 988–1005.


49

Questions?

fraud analysis and other applications of unsupervised learning in property and casualty insurance

Technology

claim francis analytics

actuarial626201212 data

actuarial62620128 data

actuarial626201211 data

actuarial62620126 data

employment francis analytics

fraud study data

caarp data