distinguishing the forest from the trees 2006 cas ratemaking seminar
DESCRIPTION
Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar. Richard Derrig, PhD, Opal Consulting www.opalconsulting.com Louise Francis, FCAS, MAAA Francis Analytics and Actuarial Data Mining, Inc. www.data-mines.com. Data Mining. - PowerPoint PPT PresentationTRANSCRIPT
![Page 1: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/1.jpg)
Distinguishing the Forest from the Trees
2006 CAS Ratemaking Seminar
Richard Derrig, PhD, Opal Consulting
www.opalconsulting.comLouise Francis, FCAS, MAAA
Francis Analytics and Actuarial Data Mining, Inc.www.data-mines.com
![Page 2: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/2.jpg)
Data Mining
Data Mining, also known as Knowledge-Discovery in Databases (KDD), is the process of automatically searching large volumes of data for patterns. In order to achieve this, data mining uses computational techniques from statistics, machine learning and pattern recognition.
• www.wikipedia.org
![Page 3: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/3.jpg)
The Problem: Nonlinear Functions
![Page 4: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/4.jpg)
An Insurance Nonlinear Function:Provider Bill vs. Probability of Independent Medical Exam
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
02
00
27
53
63
45
05
60
68
38
21
98
91
19
51
45
01
80
52
54
01
13
68
Provider 2 Bill
0.30
0.40
0.50
0.60
0.70
0.80
0.90
Va
lue P
rob
IM
E
![Page 5: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/5.jpg)
Common Links for GLMs
Y
1
)1
ln(Y
Y
CDF normal thedenotes ),( Y
The identity link: h(Y) = Y
The log link: h(Y) = ln(Y)
The inverse link: h(Y) =
The logit link: h(Y) =
The probit link: h(Y) =
![Page 6: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/6.jpg)
Nonlinear Example Data
Provider 2 Bill (Bonded)
Avg Provider 2 Bill
Avg Total Paid Percent IME
Zero - 9,063 6%
1 – 250 154 8,761 8%
251 – 500 375 9,726 9%
501 – 1,000 731 11,469 10%
1,001 – 1,500 1,243 14,998 13%
1,501 – 2,500 1,915 17,289 14%
2,501 – 5,000 3,300 23,994 15%
5,001 – 10,000 6,720 47,728 15%
10,001 + 21,350 83,261 15%
All Claims 545 11,224 8%
![Page 7: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/7.jpg)
Desirable Features of a Data Mining Method:
Any nonlinear relationship can be approximated
A method that works when the form of the nonlinearity is unknown
The effect of interactions can be easily determined and incorporated into the model
The method generalizes well on out-of sample data
![Page 8: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/8.jpg)
Decision Trees
In decision theory (for example risk management), a decision tree is a graph of decisions and their possible consequences, (including resource costs and risks) used to create a plan to reach a goal. Decision trees are constructed in order to help with making decisions. A decision tree is a special form of tree structure.
• www.wikipedia.org
![Page 9: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/9.jpg)
Different Kinds of Decision Trees
Single Trees (CART, CHAID) Ensemble Trees, a more recent development
(TREENET, RANDOM FOREST)• A composite or weighted average of many trees
(perhaps 100 or more)
• There are many methods to fit the trees and prevent overfitting
• Boosting: Iminer Ensemble and Treenet
• Bagging: Random Forest
![Page 10: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/10.jpg)
The Methods and Software Evaluated
1) TREENET 5) Iminer Ensemble
2) Iminer Tree 6) Random Forest
3) SPLUS Tree 7) Naïve Bayes (Baseline)
4) CART 8) Logistic (Baseline)
![Page 11: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/11.jpg)
CART Example of Parent and Children NodesTotal Paid as a Function of Provider 2 Bill
1st Split
All Data
Mean = 11,224
Bill < 5,021
Mean = 10,770
Bill>= 5,021
Mean = 59,250
![Page 12: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/12.jpg)
CART Example with Seven Nodes IME Proportion as a Function of Provider 2 Bill
![Page 13: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/13.jpg)
CART Example with Seven Step FunctionsIME Proportions as a Function of Provider 2 Bill
![Page 14: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/14.jpg)
Ensemble Prediction of Total Paid
0 0 0 0 0 0 0 0 0 0 0 0 0 0 81 19427236545757570586010401275156021353489
Provider 2 Bill
0.00
10000.00
20000.00
30000.00
40000.00
50000.00
60000.00V
alu
e T
reen
et P
red
icte
d
![Page 15: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/15.jpg)
Ensemble Prediction of IME Requested
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10
02
00
27
53
63
45
05
60
68
38
21
98
91
19
51
45
01
80
52
54
01
13
68
Provider 2 Bill
0.30
0.40
0.50
0.60
0.70
0.80
0.90
Va
lue P
rob
IM
E
![Page 16: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/16.jpg)
Naive Bayes
Naive Bayes assumes conditional independence Probability that an observation will have a specific set
of values for the independent variables is the product of the conditional probabilities of observing each of the values given category cj
)|()|( j
jij cxPcXP
![Page 17: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/17.jpg)
Bayes Predicted Probability IME Requested vs. Quintile of Provider 2 Bill
![Page 18: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/18.jpg)
Naïve Bayes Predicted IME vs. Provider 2 Bill
0 97 18
12
65
34
94
33
51
76
01
68
57
69
85
39
39
10
25
11
10
11
99
12
85
13
71
14
65
15
54
16
49
17
45
18
38
19
45
20
50
21
49
22
60
23
80
25
12
26
37
27
60
28
95
30
42
31
96
33
91
35
88
38
05
40
60
43
35
47
05
52
00
59
44
71
26
92
88
13
76
7
Provider 2 Bill
0.060000
0.080000
0.100000
0.120000
0.140000
Me
an
Pro
ba
bil
ity
IM
E
![Page 19: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/19.jpg)
The Fraud Surrogates used as Dependent Variables
Independent Medical Exam (IME) requested
Special Investigation Unit (SIU) referral IME successful SIU successful DATA: Detailed Auto Injury Claim
Database for Massachusetts Accident Years (1995-1997)
![Page 20: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/20.jpg)
Results for IME Requested
Area Under the ROC Curve – IME Decision CART
Tree S-PLUS
Tree Iminer Tree TREENET AUROC 0.669 0.688 0.629 0.701 Lower Bound 0.661 0.680 0.620 0.693 Upper Bound 0.678 0.696 0.637 0.708 Iminer
Ensemble Random Forest
Iminer Naïve Bayes Logistic
AUROC 0.649 703 0.676 0.677 Lower Bound 0.641 695 0.669 0.669 Upper Bound 0.657 711 0.684 0.685
![Page 21: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/21.jpg)
TREENET ROC Curve – IMEAUROC = 0.701
![Page 22: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/22.jpg)
Ranking of Methods/Software – 1st Two Surrogates
Ranking of Methods By AUROC - Decision Method SIU AUROC SIU Rank IME Rank IME
AUROC Random Forest 0.645 1 1 0.703 TREENET 0.643 2 2 0.701 S-PLUS Tree 0.616 3 3 0.688 Iminer Naïve Bayes 0.615 4 5 0.676 Logistic 0.612 5 4 0.677 CART Tree 0.607 6 6 0.669 Iminer Tree 0.565 7 8 0.629 Iminer Ensemble 0.539 8 7 0.649
![Page 23: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/23.jpg)
Ranking of Methods/Software – Last Two Surrogates
Ranking of Methods By AUROC - Favorable Method SIU AUROC SIU Rank IME Rank IME
AUROC TREENET 0.678 1 2 0.683 Random Forest 0.645 2 1 0.692 S-PLUS Tree 0.616 3 5 0.664 Logistic 0.610 4 3 0.677 Iminer Naïve Bayes 0.607 5 4 0.670 CART Tree 0.598 6 7 0.651 Iminer Ensemble 0.575 7 6 0.654 Iminer Tree 0.547 8 8 0.591
![Page 24: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/24.jpg)
Plot of AUROC for SIU vs. IME Decision
0.50 0.55 0.60 0.65
AUROC.SIU
0.50
0.55
0.60
0.65
0.70
AU
RO
C.IM
E
CART
IM Ensemble
IM Tree
Logistic/Naive Bayes
Random Forest
S-PLUS Tree
Treenet
![Page 25: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/25.jpg)
Plot of AUROC for SIU vs. IME Favorable
0.55 0.60 0.65
SIU.AUROC
0.55
0.60
0.65
0.70
IME
.AU
RO
C
IM Tree
IM Ensemble CART
NBayesLogistic Regression
S-PLUS Tree
RandomForestTreenet
![Page 26: Distinguishing the Forest from the Trees 2006 CAS Ratemaking Seminar](https://reader031.vdocument.in/reader031/viewer/2022032805/5681311a550346895d9781d2/html5/thumbnails/26.jpg)
References Brieman, L., J. Freidman, R. Olshen, and C.
Stone, Classification and Regression Trees, Chapman Hall, 1993.
Friedman, J., Greedy Function Approximation: The Gradient Boosting Machine, Annals of Statistics, 2001.
Hastie, T., R. Tibshirani, and J. Friedman, The Elements of Statistical Learning, Springer, New York., 2001.
Kantardzic, M., Data Mining, John Wiley and Sons, 2003.