fraud analysis and other applications of unsupervised learning in property and casualty insurance
TRANSCRIPT
Applications of Unsupervised Learning in Property and Casualty Insurancewith emphasis on fraud analysis
Louise Francis, FCAS, MAAA
Francis Analytics and Actuarial Data Mining, Inc.
www.data-mines.com
Objectives
Review classic unsupervised learning techniques
Introduce 2 new unsupervised learning techniques RandomForest PRIDIT
Apply the techniques to insurance data Automobile Fraud data set A publically available automobile
insurance database
Motivation for Topic
New book: Predictive Modeling in Actuarial Science An introduction to predictive modeling for
actuaries and other insurance professionals Publisher: Cambridge University Press Hope to Publish: Fall 2012 Chapter on Unsupervised Learning Li Yang and Louise Francis
Li Yang – Variable grouping (PCA) Louise Francis- record grouping (clustering)
Book Project Predictive Modeling 2 Volume Book Project A joint project leading to a two volume pair of books on
Predictive Modeling in Actuarial Science. Volume 1 would be on Theory and Methods and Volume 2 would be on Property and Casualty
Applications. The first volume will be introductory with basic
concepts and a wide range of techniques designed to acquaint actuaries with this sector of problem solving techniques. The second volume would be a collection of applications to P&C problems, written by authors who are well aware of the advantages and disadvantages of the first volume techniques but who can explore relevant applications in detail with positive results.
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
5
The Fraud Study Data
• 1993 AIB closed PIP claims• Dependent Variables
• Suspicion Score• Expert assessment of liklihood of
fraud or abuse• Predictor Variables
• Red flag indicators• Claim file variables
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
6
The Fraud Problemfrom: www.agentinsure.com
04/12/2023 Francis Analytics and Actuarial Da Mining, Inc.
7
The Fraud Problem (2)from Coalition Against Insurance Fraud
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
8
Fraud and Abuse
Planned fraud Staged accidents
Abuse Opportunistic Exaggerate claim
The Fraud Red Flags
Binary variables that capture characteristics of claims associated with fraud and abuse
Accident variables (acc01 - acc19) Injury variables (inj01 – inj12) Claimant variables (ch01 – ch11) Insured variables (ins01 – ins06) Treatment variables (trt01 – trt09) Lost wages variables (lw01 – lw07)
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
10
The Red Flag VariablesR e d F lag V ariab le s
I n d ic a to r S u b jec t V ariab le D e s c rip tio n Accident A C CO 1 No report by police officer at scene
A 0 0 0 4 Single vehicle accident A 0 0 0 9 No plausible explanation for accident ACC10 Claimant in old, low valued vehicle ACC11 Rental vehicle involved in accident ACC14 Property Damage was inconsistent with accident ACC15 Very minor impact collision ACC16 Claimant vehicle stopped short ACC19 Insured felt set up, denied fault
Claimant CLT02 Had a history of previous claims C L T 0 4 W as an o u t o f s ta te acc id en t
CLT07 Was one of three or more claimants in vehicle In ju ry INJO1 In ju r y c o n s is te d o f s t r a in o r s p ra in o n ly
IN J 0 2 No objective evidence of injury IN JO 3 Police report showed no injury or pain INJ05 No emergency treatment was given IN JO 6 N o n -em erge n cy t re atm e nt w as de layed INJ11 Unusual injury for auto accident
Insured IN S O 1 Had history of previous claims IN SO 3 Readily accepted fault for accident IN SO 6 Was difficult to contact/uncooperative IN SO 7 Accident occurred soon after effective date
Lost Wages LWO1 Claimant worked for self or a family member LW03 Claimant recently started employment
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
11
Dependent Variable Problem Insurance companies frequently do
not collect information as to whether a claim is suspected of fraud or abuse
Even when claims are referred for special investigation
Solution: unsupervised learning
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
12
Supervised Learning
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
13
Dimension Reduction
ZipCodeFrequency
BIFrequency
PDFrequency
Comb
PolicyCountNonBusines
sUse
VehicleCountNonBusin
essUse SeverityBI SeverityPD90095 - 54.50 0.03 2.00 3.00 1,973.50 93741 - - - 1.00 1.00 90015 22.65 43.93 0.04 1.00 2.00 10,181.16 2,442.36 90067 15.53 44.41 0.04 3.00 6.00 13,146.57 2,565.56 90004 26.71 48.45 0.04 11.00 17.00 8,538.56 2,354.08
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
14
The CAARP Data
This assigned risk automobile data was made available to researchers in 2005 for the purpose of studying the effect of change in regultion on territorial variables
contain exposure information (car counts, premium) and claim and loss information (Bodily Injury (BI) counts, BI ultimate losses, Property Damage (PD) claim counts, PD ultimate losses).
Each record is a zip code Good example of using unsupervised learning for
territory construction
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
15
R Cluster Library
The “cluster” library from R used Many of the functions in the library
are described in the Kaufman and Rousseeuw’s (1990) classic bookon clustering. Finding Groups in Data.
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
16
Grouping Records
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
17
Dissimilarity
Euclidian Distance: the record by record squared difference between the value of each the variables for a record and the values for the record it is being compared to.
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
18
RF Similarity Varies between 0 and 1 Proximity matrix is an output of RF
After a tree is fit, all records run through model If 2 records in same terminal node, their
proximity increased by 1 1-proximity forms distance
Can be used as an input to clustering and other unsupervised learning procedures
See “Unsupervised Learning with Random Forest Predictors” by Shi and Horvath
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
19
Clustering
Hierarchical clustering K-Means clustering This analysis uses k-means
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
21
K-means Clustering
An iterative procedure is used to assign each record in the data to one of the k clusters.
The iteration begins with the initial centers or mediods for k groups.
uses a dissimilarity measure to assign records to a group and to iterate to a final grouping. An iterative procedure is used to assign each record to one of the k clusters. by the user,
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
22
R Cluster Output
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
23
Cluster Plot
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
24
Silhouette Plot
Silhouette Plot RF Proximity
Silhouette Plot – Euclidean Distance Clustering
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
27
Testing using Expert Scores: Fit a Tree to Suspicion Score for Importance Ranking
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
28
Importance Ranking of the Clusters
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
29
Fit Tree to Binary Fraud Indicator
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
30
Importance Ranking (2)
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
31
RF Ranking of the “Predictors”: Top 10 of 44
Variable MeanDecreaseGini Description
acc10 10.50 Claimant in old low value vehical
trt01 9.05 arge # visits to chiro
inj01 8.64 strain or sprain
inj02 8.64 readily accepted fauld
inj05 8.62 non emergency treatment given for injury
acc01 8.55 no police report
clt07 7.47 one of 3 or more claimants in vehical
inj06 7.44 non emergency trt delayed
acc15 7.36 very minor collision
trt03 6.82 large # visits to PT
Problem: Categorical Variables It is not clear how to best perform
Principal Components/Factor Analysis on categorical variables The categories may be coded as a
series of binary dummy variables If the categories are ordered
categories, you may loose important information
This is the problem that PRIDIT addresses
RIDIT
Variables are ordered so that lowest value is associated with highest probability of fraud
Use Cumulative distribution of claims at each value, i, to create RIDIT statistic for claim t, value i
ˆ ˆti tj tjj i j i
R p p
Example: RIDIT for Legal Representation
Legal Representation
Proportion Proportion
Value Code Number Proportion Below Above RIDITYes 1 706 0.504 0.000 0.496 -0.496No 2 694 0.496 0.504 0.000 0.504
PRIDIT
Use RIDIT statistics in Principal Components Analysis
Component Matrixa
.248
.220
.709
.752
.341
.406
SIU
Police Report
At Fault
Legal Rep
Medical Audit
Prior Claim
1
Component
Extraction Method: Principal Component Analysis.
1 components extracted.a.
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
36
PRIDITS of Accident Flags
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
38
Fit Tree with PRIDITS for Each Type of Flag
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
39
Importance Ranking of Pridits
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
40
Importance Ranking of Factors
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
41
Add RF and Euclid Clusters to PRIDIT Factors
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
42
Use Salford RF MDS
Top variable in importance (acc10) used as binary dependent
Run tree with 1,000 forests Output proximities and MDS Use MDS scales as to cluster
(k=3) Run Tree to get Importance
ranking
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
43
MDS Graph
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
44
Rank of cluster procedures to Tree Prediction
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
45
Labeling Clusters
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
46
Relation Between PRIDIT Factor and Suspicion
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
47
Next Steps
Add claim file variables Rerun clusters Rerun PRIDITS
Do Random Forest proximities on the RIDITS
Apply the procedures to other fraud databases
PRIDIT REFERENCESAi, J., Brockett, Patrick L., and Golden, Linda L. (2009) “Assessing Consumer
Fraud Risk in Insurance Claims with Discrete and Continuous Data,” North American Actuarial Journal 13: 438-458.
Brockett, Patrick L., Derrig, Richard A., Golden, Linda L., Levine, Albert and Alpert, Mark, (2002), Fraud Classification Using Principal Component Analysis of RIDITs, Journal of Risk and Insurance, 69:3, 341-373.
Brockett, Patrick L., Xiaohua, Xia and Derrig, Richard A., (1998), Using Kohonen’ Self-Organizing Feature Map to Uncover Automobile Bodily Injury Claims Fraud, Journal of Risk and Insurance, 65:245-274
Bross, Irwin D.J., (1958), How To Use RIDIT Analysis, Biometrics, 4:18-38.Chipman, H.E.I. George and R.E. McCulloch, 2006, Baysian Ensemble Learning,
Neural Information Processing Systems
Lieberthal, Robert D., (2008), Hospital Quality: A PRIDIT Approach, Health Services Research, 43:3, 988–1005.
04/12/2023 Francis Analytics and Actuarial Data Mining, Inc.
49
Questions?