data science in non-life insurance pricing · predicting claims frequencies using tree-based models...
TRANSCRIPT
![Page 1: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/1.jpg)
Data Science in Non-Life Insurance Pricing
Predicting Claims Frequencies using Tree-Based Models
Master’s Thesis
Submitted for the degree of Master of Science ETH in Mathematics
Author
Patrick Zochbauer
Supervisor
Prof. Dr. Mario V. WuthrichDepartment of Mathematics
ETH Zurich
Co-Supervisor
Dr. Christoph BuserAXA Winterthur
Date of Submission: 22. July 2016
![Page 2: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/2.jpg)
![Page 3: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/3.jpg)
Abstract
Today, generalized linear models (GLM) are the standard methods in pricing of non-life insurance
products. However they require detailed knowledge about the structure of the data, which needs
to be provided a priori by the pricing actuary. In this thesis, we present more flexible tree-based
methods, that do not su↵er from this drawback. In the first part, we discuss their theoretical
properties. In the second part, we apply the methods to the problem of predicting the claims
frequency of a policyholder. The empirical analysis is based on the Motor Insurance Collision Data
Set of AXA Winterthur. We conclude that the regression tree methodology provides a useful tool
for building homogeneous subportfolios with similar risk characteristics. The more advanced tree
ensemble methods such as bagging and random forest provide competitive alternatives to GLM
models in terms of predictive capacity.
I
![Page 4: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/4.jpg)
Acknowledgement
I would like to express my profound gratitude to Prof. Dr. Mario V. Wuthrich for supervising
this master thesis. Our numerous and inspiring discussions truly deepened my understanding of
mathematics as well as its applications to actuarial sciences. I am also very thankful for Dr. Christoph
Buser for co-supervising this thesis and providing the possibility to write this work in cooperation with
AXA Winterthur. In that spirit, the previous acknowledgement also belongs to AXA Winterthur,
in particular to Peter Reinhard and the whole pricing team, for allowing me to be their guest and
providing a delightful atmosphere as well as invaluable support.
Finally, I thank my family and friends for their confidence and constant support during all my
time at ETH Zurich.
II
![Page 5: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/5.jpg)
Contents
1 Introduction 1
I Theory 3
2 Statistical Learning Theory 4
2.1 Mathematical Framework and Terminology . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2 Prediction and Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.1 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3 Generalization Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7
2.3.1 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.2 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 Maximum Likelihood Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10
3 Tree-Based Models 14
3.1 Basics of Tree-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14
3.1.1 Tree Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15
3.1.2 Growing a Regression Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16
3.1.3 Cost-Complexity Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17
3.2 Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.1 Poisson Regression Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22
3.2.2 Weighted Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25
3.2.3 Classification Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27
3.3 Additional Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
3.3.1 Categorical Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
III
![Page 6: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/6.jpg)
CONTENTS IV
3.3.2 Missing Data and Surrogate Splits . . . . . . . . . . . . . . . . . . . . . . . . . 29
3.3.3 Variable Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30
3.4 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
3.4.2 Bagged Tree Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34
3.4.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35
3.4.4 Model Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38
II Data Analysis 42
4 Motor Insurance Collision Data 43
4.1 Non-Life Insurance Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43
4.2 Motor Insurance Collision Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45
4.2.1 Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46
4.2.2 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47
4.2.3 Multivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55
4.3 Simulated Insurance Claims Count Data . . . . . . . . . . . . . . . . . . . . . . . . . . 56
4.3.1 Negative-Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 57
4.3.2 Choice of Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58
5 Data Analysis: Tree Model 60
5.1 Fitting a Tree Model in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60
5.2 Regression Tree for Claims Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . 62
5.3 Regression Tree for Simulated Claims Frequency . . . . . . . . . . . . . . . . . . . . . 63
5.3.1 Choice of Parameter for Poisson Tree Predictor . . . . . . . . . . . . . . . . . . 64
5.3.2 Stratified Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66
5.3.3 Optimal Tree Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68
5.3.4 Tree Size Selection via Statistical Significance Pruning . . . . . . . . . . . . . . 73
5.3.5 Variability of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75
5.3.6 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.3.7 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79
5.4 Univariate Regression Tree for Real Data Claims Frequency . . . . . . . . . . . . . . . 81
5.5 Bivariate Regression Tree for Real Data Claims Frequency . . . . . . . . . . . . . . . . 89
![Page 7: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/7.jpg)
CONTENTS V
5.5.1 Tree Size Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89
5.5.2 Comparison: Tree and GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90
5.5.3 Out of Sample Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92
5.6 Multivariate Regression Tree for Real Data Claims Frequency . . . . . . . . . . . . . . 94
5.6.1 Out of Sample Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97
6 Data Analysis: Bagging 99
6.1 Fitting a Bagged Tree Predictor in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99
6.2 Bagging for Simulated Claims Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2.1 Tree Size of Individual Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100
6.2.2 Number of Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.2.3 Out of Sample Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102
6.3 Bagging for Real Data Claims Frequency . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.3.1 Bivariate Bagged Tree Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . 103
6.3.2 Multivariate Bagged Tree Predictor . . . . . . . . . . . . . . . . . . . . . . . . 104
6.3.3 Out of Sample Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 105
6.3.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106
6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109
7 Data Analysis: Random Forest 110
7.1 Fitting a Random Forest Predictor in R . . . . . . . . . . . . . . . . . . . . . . . . . . 110
7.2 Random Forest for Simulated Claims Frequency . . . . . . . . . . . . . . . . . . . . . . 110
7.3 Random Forest for Real Data Claims Frequency . . . . . . . . . . . . . . . . . . . . . 111
7.3.1 Restriction to Full Durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113
7.3.2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114
7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115
8 Summary 117
8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118
Appendices
A Proofs 120
A.1 Proof of Equation (3.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
![Page 8: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/8.jpg)
CONTENTS VI
A.2 Proof of Proposition 3.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120
A.3 Proof of Equation (3.7) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 121
A.4 Proof of Equation (3.12) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
A.5 Proof of Remark 4.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122
A.6 Proof of Lemma 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123
B Correlation Plot for Additional Covariates 124
C R Code 125
C.1 Variable Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125
C.2 Stratified Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127
C.3 Significance Test Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128
C.4 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130
D Variable Description 131
![Page 9: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/9.jpg)
Chapter 1
Introduction
The amount of data available in the world has been exploding over the last decade. Being able
to process this flood of data is one major challenge people and businesses are faced with today.
The potential benefits of data analytics are present in any industry, and so they are in the insurance
sector. The heart of an insurance company is based on underwriting risks. Thereby actuaries need to
evaluate the potential losses transfered to the insurance company and set a premium accordingly. The
data used to validate these decisions is growing both in the number of explanatory variables and the
amount of data points available. The standard tools to process this data are generalized linear models
(GLM). The GLM methodology requires a detailed a priori knowledge about the structure of the
data. If the structure is known, they provide high predictive capacity. However recent development
towards an extensive collection of not only more granular data but also the collection of more and
more variables raises the need for more flexible methods. In this thesis, alternative approaches,
called tree models, are discussed in details. Tree models do not su↵er from the limitation of an a
priori knowledge of the structure and can also be used when a large number of potential variables
is available. We use the Motor Insurance Collision Data Set of AXA Winterthur to analyze and
compare the models. Since the data is already extensively studied and well understood by AXA
Winterthur, their GLM model provides an optimal benchmark. This thesis is organized in two parts:
Theory and Data Analysis. We begin with a discussion on the basic notions of statistical learning in
Chapter 2. Chapter 3 introduces the tree-based models regression tree, bagging and random forest.
The second part starts with an introduction to the fundamentals of non-life insurance pricing and a
description of the Motor Insurance Collision Data Set in Chapter 4. We define the pure premium of
an insurance product as
Pure Premium = Expected Claims Frequency · Expected Claim Severity.
Based on this equation, actuaries model the two quantities claims frequency and claim severity
separately. In this thesis, we only focus on the claims frequency models. Chapters 5 to 7 are
dedicated to analyzing and applying the models introduced in the first part to insurance claims
1
![Page 10: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/10.jpg)
2
data. Thereby a strong focus is put on understanding the individual tree predictor, since it serves
as the key building block for both bagging and random forest. To fully understand the benefits and
drawbacks of the various models, the data analysis was conducted both on simulated data and the
data set of AXA Winterthur.
![Page 11: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/11.jpg)
Part I
Theory
![Page 12: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/12.jpg)
Chapter 2
Statistical Learning Theory
We start with a short introduction to the main statistical concepts that we will encounter in this
thesis. The chapter is based on Buhlmann and Machler (2014), James et al. (2014) as well as
Wuthrich (2015).
2.1 Mathematical Framework and Terminology
Let (⌦,F , P ) be a probability space containing the random variables X 2 Rp and " 2 R with
E(") = 0. Suppose the data point (X, Y ) 2 Rp ⇥ R is modeled by the relationship
Y = f(X) + ", (2.1)
where f : Rp ! R is a deterministic, but unknown function of X = (X1, . . . , Xp
). The variables
X1, . . . , Xp
are called covariates, predictor or explanatory variables. The function f is interpreted
to capture all systematic e↵ects of the covariates X1, . . . , Xp
on the response Y . The term " can
be understood as a random noise accounting for the measurement error or inability of the model to
capture all underlying non-systematic e↵ects. Using E(") = 0, it follows that f(x) = E(Y |X = x).
Given a training/learning set of i.i.d. previously observed data points L = {(xi
, yi
) : i = 1, . . . , n}drawn from (X, Y ), our goal is to find an estimate f(x,L) of the unknown function f based on
L. We drop the argument L and write fn
(x), if there is no ambiguity regarding the training set L.If the response Y is continuous, we refer to the problem of determining an optimal estimate f
n
by
regression.
We use the convention that random variables are denoted by capital and their realizations by
small letters. To simplify notation, vectors x = (x1, . . . , xp) 2 Rp are frequently written in bold face.
Example 2.1. The best known example for regression is linear regression, where we assume that
f(x1, . . . , xp) = �0 + �1x1 + . . .+ �p
xp
,
4
![Page 13: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/13.jpg)
2.2. PREDICTION AND INFERENCE 5
for unknown parameters �0, . . . ,�p 2 R. Given the training set L = {(xi
, yi
) : i = 1, . . . , n} with
x
i
= (xi1, . . . , xip), an estimate f
n
(x1, . . . , xp) = �0+ �1x1+ . . .+ �p
xp
can be obtained by minimizing
the squared error
min�0,...,�p
nX
i=1
(yi
� fn
(xi1, . . . , xip))
2. (2.2)
Figure 2.1 shows an example of linear regression for p = 1. The training data set L = {(xi
, yi
) : i =
1, . . . , 25} was generated from
Y = 1 +1
2X + ",
where X ⇠ U [0, 100] and " ⇠ N (0, 100).
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
0 20 40 60 80 100
−20
020
4060
x
y =
f(x)
Figure 2.1 – An example for linear regression with p = 1. The true underlying function f(x) = 1+ 0.5xis indicated by the red line. The blue line shows the linear regression estimate f25 determined by (2.2)from n = 25 i.i.d. observations (xi, yi).
?
2.2 Prediction and Inference
2.2.1 Prediction
In prediction we face the problem of forecasting the outcome of a random response variable Y ⇠ P
for given covariates x1, . . . , xp. For this problem, we specify a predictor Y of Y = f(x1, . . . , xp) + ",
e.g. by setting Y = fn
(x1, . . . , xp). In order to assess the quality of our prediction, we need a measure
![Page 14: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/14.jpg)
2.2. PREDICTION AND INFERENCE 6
of goodness. The most commonly and widely used measure, called mean squared error, is defined by
E⇣(Y � Y )2
⌘.
Proposition 2.2. Let (⌦,F , P ) be a probability space containing the i.i.d. training data set L =
{(Xi
, Yi
) : i = 1, . . . , n} as well as an i.i.d. data point (X, Y ). Assume both the data point (X, Y )
and the learning set L follow the regression model (2.1). For any predictor Y = f(X,L), we have
the following decomposition of the conditional mean squared error
E✓⇣
Y � Y⌘2 ���X = x
◆= Bias
⇣f(x,L)
⌘2+ Var
⇣f(x,L)
⌘+ Var (") , (2.3)
where Bias⇣f(x,L)
⌘:= f(x)� E
⇣f(x,L)
⌘.
Proof. Using the regression model equation (2.1), we get that V ar(Y��X = x) = V ar("). Further-
more, conditionally given X, Y and f(X,L) are independent. Using this, the proof follows by a
direct calculation,
E✓⇣
Y � Y⌘2 ���X = x
◆= E
⇣Y 2 + f(X,L)2 � 2Y f(X,L)
���X = x
⌘
= E⇣Y 2���X = x
⌘� E (Y |X = x)2 + E
⇣f(x,L)2
⌘� E
⇣f(x,L)
⌘2
+ E(Y��X = x)2 + E
⇣f(x,L)
⌘2� 2E(Y |X = x)E
⇣f(x,L)
⌘
= V ar(Y��X = x) + V ar
⇣f(x,L)
⌘+⇣E(Y |X = x)� E
⇣f(x,L)
⌘⌘2
= Bias⇣f(x,L)
⌘2+ V ar
⇣f(x,L)
⌘+ V ar("),
where we used the conditional independence of Y and f(X,L) in the second and V ar(Y��X = x) =
V ar(") in the fourth equality.
The mean squared error is obtained by taking the expectation with respect to the covariates X, i.e.
E✓⇣
Y � Y⌘2◆
= E✓E✓⇣
Y � Y⌘2 ���X
◆◆
= E✓Bias
⇣f(X,L)|X
⌘2◆+ E
⇣Var
⇣f(X,L)|X
⌘⌘+ V ar("), (2.4)
where Bias⇣f(X,L)|X
⌘= f(X) � E
⇣f(X,L)|X
⌘. Proposition 2.2 gives more detailed insight
into the behavior of the mean squared error. Unfortunately, we cannot control the variance V ar(")
of the noise term. Hence, our goal will be to minimize the remaining two terms. Very often it is not
possible to minimize both simultaneously. This problem is commonly known as the bias-variance-
trade-o↵. Since in practice we neither know the distribution of Y nor the distribution of X, the
mean squared error cannot be calculated directly. Instead, we need to estimate it. Suppose we are
![Page 15: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/15.jpg)
2.3. GENERALIZATION PERFORMANCE 7
given n observations (xi
, yi
) for i = 1, . . . , n. We use the following estimator for the (in-sample)
mean squared error
MSE =1
n
nX
i=1
(yi
� fn
(xi
))2. (2.5)
Note that we should carefully distinguish between prediction error on data that was used during the
estimation procedure and previously unseen data. For more details, see Section 2.3 below.
2.2.2 Inference
Whereas the objective of prediction was accuracy, the aim of inference is to understand the dis-
tribution P of the random response Y . Very often P involves unknown parameters which need to
estimated. For example, given the covariates X = x we can estimate the (unknown) mean of the
response Y in model (2.1) by an estimator µY
. If we choose µY
= fn
(x), then fn
(x) serves both as a
predictor for Y but also as an estimate for E(Y |X = x) the expectation of Y |X=x
. In the regression
problem (2.1), the unknown distribution P of Y depends on the covariates X. In practice, often only
few of the p predictor variables are relevant for the response Y . Understanding the structure and
dependencies, we can reduce the complexity of the model and therefore the variance of the predictor
f(X,L). By (2.4) this eventually also improves the mean squared error.
2.3 Generalization Performance
The generalization performance is a measure for the predictive capacity of a statistical method on
new, previously unseen test data. In general, to fit a method to a data set, one aims to minimize the
expected error for a given loss function, e.g.
E⇣L⇣Y, Y
⌘⌘= E
✓⇣Y � Y
⌘2◆,
with loss function L(Y, Y ) = (Y � Y )2, previously called mean squared error. For the sequel, we
use the squared error loss to illustrate all concepts. In general, the loss function L can be chosen
according to the problem, see e.g. Section 2.4. Since most methods are designed to minimize the
in-sample error, it is overoptimistic to use the in-sample error as an estimate for the generalization
performance as well. Caring about the performance on previously unseen data, we should use a
training set L to fit the predictor and evaluate the performance on new i.i.d. test data (Xnew
, Ynew
).
Based on that, we are interested in the expected loss given by
GE := E✓⇣
Ynew
� f(Xnew
,L)⌘2◆
= E✓E✓⇣
Ynew
� f(Xnew
,L)⌘2 ����L
◆◆, (2.6)
![Page 16: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/16.jpg)
2.3. GENERALIZATION PERFORMANCE 8
where we use f(Xnew
,L) as a predictor for Ynew
. The right-hand side of equation (2.6) states that
we first fix the training set L and consider the expected quadratic error of Ynew
for given predictor
f(Xnew
,L). In a second step, we average over all training set L, i.e. predictors f(Xnew
,L).
2.3.1 Cross-Validation
In practice, we need to estimate the generalization error since we usually neither know the distribution
of Y nor the distribution of X. Motivated by (2.6) we may partition the data into two independent,
i.e. disjoint sets: A training set L as well as a testing set T . However, there is usually only limited
amount of data available such that it is not possible to split the data into two reasonably large
disjoint sets. Cross-validation (CV) is a general method to estimate the generalization performance
whilst making maximal use of the data available. For our purpose, we are going to describe K-fold
CV: Assume we are given i.i.d. observations L = {(x1, y1), . . . , (xn
, yn
)} of (X, Y ). In order to get
both training and test data sets, we randomly partition the observations L into K roughly equally
sized subsets Bk
⇢ L such that Bi
\ Bj
= ; for i 6= j. We then use the k-th set for testing and the
remaining K � 1 sets to train the predictor denoted by f(x,L\Bk
). We do this for all k = 1, . . . ,K
and evaluate the performance on the set Bk
. This leads to the generalization error estimate
dGE =1
K
KX
k=1
1
|Bk
|X
{i:(xi
,y
i
)2Bk
}
(yi
� f(xi
,L\Bk
))2. (2.7)
Equation (2.7) can be seen as an estimate of the right-hand side of equation (2.6). There are several
possible choices to obtain the partition (Bk
)k=1,...,K . In the remainder, we will use the following two:
Classical: Most commonly, the partition (Bk
)k=1,...,K is obtained by drawing without replace-
ment from the data L.
Stratified: For stratified cross-validation, we first order the data according to a stratification
variable (e.g. the response). Then we form bins with the K smallest (according to the strat-
ification variable), then the K next smallest and so on. In the end, we draw from these bins
without replacement resulting in a partition of the data, for which all subsets Bk
have a similar
distribution of the stratification variable. For more details see Algorithm 5.1 below.
Note that in general the resulting estimators have a di↵erent distribution. Consider for example
the situation where the response variable y is used for stratification and the data L contains two
outliers, i.e. for all i 2 {1, . . . , n}\{k1, k2}, yi < yk1 and y
i
< yk2 . In stratified cross-validation, these
outliers are assigned to two di↵erent sets Bk
with probability one. On the other hand, in classical
cross-validation, the outliers may fall in the same subsets with positive probability. Although it is
di�cult to assess the estimate (2.7) in terms of bias and variance, empirical results by Kohavi (1995)
based on simulated as well as real data suggest that stratified cross-validation is a better scheme,
both in terms of bias and variance, when compared to classical cross-validation. For describing the
![Page 17: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/17.jpg)
2.3. GENERALIZATION PERFORMANCE 9
uncertainty of the estimate of the generalization error, we consider the standard deviation of dGE.
Define for k = 1, . . . ,K
CVk
:=1
|Bk
|X
{i:(xi
,y
i
)2Bk
}
(yi
� f(xi
,L\Bk
))2,
the performance on the k-th set. Assuming that |B1| = . . . = |BK
|, i.e. the Bk
are equally sized and
using that the data points (x1, y1), . . . , (xn
, yn
) are i.i.d., then the CVk
are identically distributed
as well. Unfortunately, they are in general not independent. However, assuming that the CVk
are
uncorrelated, it follows for the standard deviation of dGE,
sd(dGE) =
vuutV ar
1
K
KX
k=1
CVk
!=
1
K
vuutKX
k=1
V ar(CVk
) =sd(CV1)p
K.
This motivates the estimator\
sd(dGE) =bsd(CV1, . . . , CV
K
)pK
, (2.8)
where bsd denotes the sample standard deviation. For observations z1, . . . , zn, the sample standard
deviation is given by
bsd(z1, . . . , zn) =
vuut 1
n� 1
nX
i=1
(zi
� µ)2, with µ =1
n
nX
i=1
zi
.
2.3.2 Bootstrap
Introduced by Efron (1979), bootstrap is another general tool to asses the accuracy of a statistical
method. Consider the situation where the data L = {(x1, y1), . . . , (xn
, yn
)} are i.i.d. realization of
an unknown probability distribution P . Let f(x,L) denote a predictor obtained from the data L. Ifwe want to compute any statistics of f(x,L), we need to know its distribution. Deriving the exact
distribution is typically impossible, unless f is a simple function and P a convenient distribution.
The idea of Efron (1979) is as follows: If we knew the true P , we could simulate the distribution
of f with arbitrary accuracy. However, since we do not know the true underlying P , we use the
empirical distribution Pn
instead. The empirical distribution Pn
gives probability mass 1/n to every
data point (xi
, yi
) for i = 1, . . . , n. We can simulate i.i.d. data L⇤ = {(x⇤1, y
⇤1), . . . , (x
⇤n
, y⇤n
)} from
Pn
and compute the bootstrapped predictor f(x,L⇤). Repeating this many times approximates the
distribution of f(x,L⇤), which we use as an estimate of the distribution of f(x,L). Algorithm 2.1
shows how bootstrapping can be used to estimate the generalization error. Note that the quality of
the bootstrap approximation heavily depends on the quality and richness of the data L.
![Page 18: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/18.jpg)
2.4. MAXIMUM LIKELIHOOD METHOD 10
Algorithm 2.1 Out-of-bag bootstrap algorithm to estimate the generalization error
for b = 1, . . . , B do
1. Generate a bootstrap sample L⇤b
:= {(x⇤1,b, y
⇤1,b), . . . , (x
⇤n,b
, y⇤n,b
)} by drawing with replacementfrom the data L. Define the out-of-bootstrap sample
L⇤b,out
=n[
i=1
{(xi
, yi
) : (xi
, yi
) /2 L⇤b
},
of all observations not included in the bootstrap sample.
2. Compute the bootstrapped predictor f(x,L⇤b
).
end forCalculate the generalization error of the bootstrap distribution by,
EP
n
((Y ⇤ � f(X⇤,L⇤))2) ⇡ 1
B
BX
b=1
1
|L⇤b,out
|X
{i:(xi
,y
i
)2L⇤b,out
}
(yi,b
� f(xi,b
,L⇤b
))2, (2.9)
where (X⇤, Y ⇤) ⇠ Pn
and |L⇤b,out
| = 1, if L⇤b,out
= ;. Note that for B ! 1, equation (2.9) becomesan equality. Furthermore, using the out-of-bootstrap sample for testing reduces over-optimism dueto data points included in both training and testing.
2.4 Maximum Likelihood Method
A common procedure to obtain a predictor from previously observed data is by minimization of
the in-sample sum of squaresP
n
i=1(yi � fn
(xi
))2 or equivalently the mean squared error (2.5), see
Example 2.1. However, if we knew more about the underlying distribution of the response variable,
we might be able to derive a better minimization criterion. This idea leads to a generalization of the
squared error loss function.
In this section, our goal is to find the model maximizing the probability that the observed data
originates from that particular model. This principle is called maximum likelihood (ML) method.
Assume that we have independent data y1, . . . , yn, where the yi
are drawn from Yi
described by
density functions g✓i
that depend on a common unknown parameter ✓ 2 Rk for some k 2 N. Due to
independence, the joint density, called likelihood, of our observations is given by
L(✓) =nY
i=1
g✓i
(yi
),
and the log-likelihood by
logL(✓) =nX
i=1
log g✓i
(yi
).
In the sequel we will use the log-likelihood since the logarithm is an increasing transformation and
![Page 19: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/19.jpg)
2.4. MAXIMUM LIKELIHOOD METHOD 11
therefore does not change the location of the maxima, but it is more convenient to work with. The
classical approach is to choose the parameter ✓ such that the probability of observing the data
y1, . . . , yn is maximized, i.e.
✓ = argmax✓2Rk
L(✓) = argmax✓2Rk
logL(✓), (2.10)
assuming that this maxima is well-defined.
Example 2.3. Let N1, . . . , Nn
be independent and Poi(�vi
) for given vi
> 0, then ✓ = � and
g✓i
(k) =(�v
i
)k
k!e��v
i .
The maximum likelihood estimator of � is given by
� =1Pn
i=1 vi
nX
i=1
Ni
.
Proof. Using the independence and Poisson assumption, we get the log-likelihood function
logL(�) =nX
i=1
log
✓(�v
i
)Ni
Ni
!exp (��v
i
)
◆.
Di↵erentiating with respect to � and evaluation at zero gives,
@
@�logL(�) =
nX
i=1
Ni
�� v
i
= 0.
Solving for � ends the proof. ⇤
?
In order to apply this methodology to the regression problem (2.1), define the conditional re-
sponse Y |X=x
. Assume there exists a density function g✓(x) depending on an unknown para-
meter function ✓ : Rp ! Rk of the covariates x with Y |X=x
⇠ g✓(x). Given i.i.d. observations
L = {(x1, y1), . . . , (xn
, yn
)} drawn from (X, Y ) following the structure of (2.1), it follows that the
yi
are independently drawn from Y |X=x
i
. Therefore the likelihood of the observations y1, . . . , yn for
given covariates x1, . . . ,xn
is given by
L(✓) =nY
i=1
g✓(xi
)(yi
).
![Page 20: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/20.jpg)
2.4. MAXIMUM LIKELIHOOD METHOD 12
Analogously to equation (2.10), we try to find ✓(·) such that
✓ = argmax✓2A
L(✓) = argmax✓2A
logL(✓), (2.11)
where A ⇢ C(Rp,Rk) denotes a subset of all admissible estimators for ✓(·) and C(Rp,Rk) the set
of all functions from Rp to Rk. In the regression problem, we are not primarily interested in the
function ✓(·) but rather in f(·). Since Y |X=x
⇠ g✓(x), we can write the expectation of Y |X=x
as a
function of ✓,
f(x) = E(Y |X = x) =
Zyg✓(x)(y)dy =: h(✓(x)),
where h : Rk ! R maps the unknown parameter ✓(x) to the expectation of Y |X=x
.
Example 2.4. We extend Example 2.3 by assuming that for given v > 0, N |X=x
⇠ Poi(�(x)v) and
define the response
Y =N
v= f(X) + ".
Then the conditional response is given by
Y |X=x
=N
v|X=x
= f(x) + ",
where the law of " is determined by N . Therefore, the parameter function ✓(·) is given by ✓(x) =
�(x)v and the regression function f(·) by
f(x) = E(Y |X = x) =1
vE(N |X = x) =
1
v✓(x).
?
Denote by ` : Rn ! R the log-likelihood as a function of the expected values of the conditional
response Yi
= Y |X=x
i
for i = 1, . . . , n given by
`(µ) := logL(f(x1), . . . , f(xn
)) = logL(h(✓(x1)), . . . , h(✓(xn
))),
where µ := (E(Y1), . . . ,E(Yn)) = (f(x1), . . . , f(xn
)). Then we can rewrite the optimization problem
(2.11) with respect to the unknown function f(·) as
f = argmaxf2A
`(µ) = argmaxf2A
`(f(x1), . . . , f(xn
)), (2.12)
where A denotes a set of admissible estimators for f(·). Alternatively to maximize `(µ), it is
equivalent to minimize the deviance statistics
D(y,µ) := 2(`(y)� `(µ)),
![Page 21: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/21.jpg)
2.4. MAXIMUM LIKELIHOOD METHOD 13
where y = (y1, . . . , yn) and µ := (f(x1), . . . , f(xn
)). The log-likelihood `(y) denotes by log-likelihood
of the saturated model, where we choose an own parameter for every observation.
Remark 2.5. Note that the set A is usually determined by the method considered to fit a model,
e.g. set of all tree predictors (see Chapter 3). J
Example 2.6. Assume that the data (xi
, yi
) for i = 1, . . . , n are drawn from the regression model
(2.1) with conditional response
Y |X=x
= f(x) + " ⇠ N (f(x), 1).
Therefore given x
i
, the yi
are independent and N (f(xi
), 1)-distributed for i = 1, . . . , n. The log-
likelihood as a function of µ = (f(x1), . . . , f(xn
)) for the observations y1, . . . , yn is given by
`(µ) =nX
i=1
log
✓1p2⇡
exp
⇢�1
2(y
i
� f(xi
))2�◆
= c(n)� 1
2
nX
i=1
(yi
� f(xi
))2,
where c(n) is a constant with respect to µ. Hence the deviance statistics is given by
D(y,µ) =nX
i=1
(yi
� f(xi
))2.
?
In Example 2.6 the deviance statistics is proportional to the mean squared error, implying that for a
normally-distributed response Y minimizing the mean squared error is equivalent to minimizing the
deviance statistics and hence obtaining the maximum likelihood estimator. In general, maximum
likelihood estimation provides a new procedure to obtain an optimal estimate for f . However, the
method requires additional knowledge about the distribution of the response such that the deviance
statistics can be calculated. In the context of insurance data, the Poisson distribution is a common
model for describing insurance claims counts. In Section 3.2.1, we will use the deviance statistics
derived from the Poisson distribution to improve the optimization criteria of the tree algorithm,
which we will study in the next chapter.
![Page 22: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/22.jpg)
Chapter 3
Tree-Based Models
The considerations made in the previous section are fully general in the sense that they can be
applied to any set A of attainable predictors. However, the No Free Lunch Theorem states that
there is not one model that works best for every problem. In this thesis, we explore so-called tree
models as the set of attainable predictors. In this chapter, we introduce the concept of tree-based
models by describing the Classification and Regression Tree (CART) algorithm, which is the most
popular method to construct tree predictors. We then discuss extensions of the tree model such as
Bagging and Random Forest. The literature used in this chapter is largely based on Hastie et al.
(2001), Breiman et al. (1984), Breiman (1996), Breiman (2001) as well as Buhlmann and Machler
(2014).
3.1 Basics of Tree-Based Models
Let us assume that the data consists of p predictor variables X 2 Rp and a response Y . Suppose we
are given n i.i.d. observations (xi
, yi
) for i = 1, . . . , n drawn from (X, Y ). The main idea of a tree-
based model is to partition the covariate space R (assume for simplicity that R = Rp, a treatment of
categorical covariates can be found in Section 3.3.1) into a set of regions R1, . . . , RM
using recursive
binary splitting. Meaning, we start with the whole predictor space R and choose a splitting variable
j and split-point s 2 R such that R = R1(j, s) [R2(j, s) with
R1(j, s) = {x 2 Rp : xj
s} = R⇥ R⇥ · · ·⇥ (�1, s]⇥ R⇥ · · ·⇥ R,
R2(j, s) = {x 2 Rp : xj
> s} = R⇥ R⇥ · · ·⇥ (s,1)⇥ R⇥ · · ·⇥ R.
Then one or both regions are again split into two more regions. The process is repeated until a
stopping rule is applied. Hence, we end up with a partition R = {R1, . . . , RM
} of R. A crucial
implication of the recursive binary partitioning is the fact that one can identify the partition R with
a binary tree T , see Section 3.1.1 as well as Figure 3.1 for an illustrative example. The tree-based
14
![Page 23: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/23.jpg)
3.1. BASICS OF TREE-BASED MODELS 15
model is then given by the piecewise constant function
f(x) =MX
m=1
cm
1{x2Rm
}. (3.1)
It remains to give an estimator cm
for cm
and an algorithm to find the splitting points. In Section
3.1.2 we describe the classical algorithm based on minimizing the sum of squares. In Section 3.2 we
discuss how this can be adapted to classification problems or the use of the deviance statistics as a
measure of goodness.
price < 18550
age < 26.5
price < 58750
0.14120 0.08202
0.11690
0.04594
0.04
594
0.14120
0.08202 0.11690
Price of Car (price)
Age
of D
river
(age
)
Figure 3.1 – A tree-based predictor fit to the response variable claims frequency (number of claims peryear) using the predictor variables age of driver and price of car. The left panel shows the partition of thetwo dimensional covariate space obtained by recursive binary splitting. The blue dots show the covariates(xi1, xi2)i=1,...,n and the numbers within each region represent the fitted constants cm. The right panelshows the corresponding tree representation.
3.1.1 Tree Terminology
In Figure 3.1 we have identified the partition R obtained by binary partitioning with a binary tree.
Before going into details on the tree growing procedure, we give a formal definition of a tree T and
introduce the common terminology used in the context of tree models.
Definition 3.1 (Node). A node t is a triple (Nt
, jt
, st
) with Nt
⇢ {1, . . . , n}, jt
2 {1, . . . , p} and
st
2 R.
Therefore each node consists of a splitting variable jt
and a split-point st
as well as a subset Nt
of
all observations. Given a node t, we can define the function left mapping
t 7! left(t) = {i 2 Nt
: xij
t
st
} ⇢ Nt
,
![Page 24: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/24.jpg)
3.1. BASICS OF TREE-BASED MODELS 16
as well as the function right mapping
t 7! right(t) = {i 2 Nt
: xij
t
> st
} = Nt
\left(t) ⇢ Nt
.
The functions left and right define the two regions into which the observations at node t are split.
The node containing all observations i = 1, . . . , n will be of special importance and hence is given a
separate name.
Definition 3.2 (Root). A node t is called root if Nt
= {1, . . . , n}.
Finally, the index sets of the regions Rm
2 R = {R1, . . . , RM
} are called leaves in the tree termino-
logy.
Definition 3.3 (Leaf). A leaf L is a subset of {1, . . . , n}.
Based on the definitions of a root, nodes and leaves, we can define a tree as a set of nodes containing
a root node and leaves.
Definition 3.4 (Tree). A set of nodes and leaves T = {t1, . . . , tK , L1, . . . , LM
} is called a tree if
1. there exists a root node t 2 {t1, . . . , tK}, write t = root(T ),
2. for every non root node t 2 {t1, . . . , tK}\{root(T )} there exists a node s 2 {t1, . . . , tK} such
that Nt
= left(s) or Nt
= right(s),
3. for every leaf L 2 {L1, . . . , LM
} there exists a node s 2 {t1, . . . , tK} such that L = left(s) or
L = right(s).
3.1.2 Growing a Regression Tree
Having introduced the common terminology, we next focus on how to obtain the partition R. For
regression the most common minimization criterion is the sum of squares given byP
n
i=1(yi�fn
(xi
))2,
which is equivalent to minimizing the mean squared error (2.5). Given a partition R the estimate
cm
minimizing the sum of squares in each region Rm
is given by
cm
=1
nm
X
{i:xi
2Rm
}
yi
, (3.2)
where nm
:= |{i : xi
2 Rm
}| is the number of observations in region Rm
(see proof in Appendix
A.1). Unfortunately, finding the best binary partition R which minimizes the overall sum of squares
is computationally unfeasible, therefore we are going to describe a greedy procedure, called CART-
algorithm:
![Page 25: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/25.jpg)
3.1. BASICS OF TREE-BASED MODELS 17
Algorithm 3.1 Classification and Regression Tree (CART) Algorithm
Step 1 - Initialization: Initialize the pairs of half-planes
R1(j, s) = {x : xj
s} and R2(j, s) = {x : xj
> s}.
Find the splitting variable j and split-point s of these half-planes such that
minj,s
2
4minc1
X
{i:xi
2R1(j,s)}
(yi
� c1)2 +min
c2
X
{i:xi
2R2(j,s)}
(yi
� c2)2
3
5 . (3.3)
Fortunately, the inner minimizations are simply given by
c1 =1
|{i : xi
2 R1(j, s)}|X
{i:xi
2R1(j,s)}
yi
and c2 =1
|{i : xi
2 R2(j, s)}|X
{i:xi
2R2(j,s)}
yi
.
Hence the minimization problem (3.3) can be solved very quickly, leading to a partitionR = R1[R2
with regions R1 = R1(j, s) and R2 = R2(j, s), where (j, s) denotes the argmin of equation (3.3).
Step 2 - Recursion: To get the next splitting variable as well as split-point, repeat the above
procedure on each region contained in R resulting in |R| di↵erent partitions Rk
for k = 1, . . . , |R|.In order to get one new region, choose the one that minimizes the overall sum of squares
K+ = argmink2{1,...,|R|}
nX
i=1
(yi
� fk
n
(xi
))2,
where fk
n
denotes the tree predictor based on the partition Rk
. Update the partition R = RK
+ .
Step 3 - Stop: Repeat Step 2 until a stopping rule is fulfilled, see Remark 3.9.
A tree predictor obtained by the above procedure will be denoted anova regression tree. In R the
package rpart provides an e�cient implementation, see Section 5.1. A generalization of Algorithm
3.1 to classification can be found in Section 3.2.3.
3.1.3 Cost-Complexity Pruning
Repeating Step 2 of Algorithm 3.1 may produce a very large tree. In an extreme case, we have as
many regions as observations, i.e. M = n and the (in-sample) mean squared error is equal to zero.
However, it is highly likely, that the predictor overfits the observations, implying poor performance
for predicting unseen data. On the other hand, if the tree size is too small, we may not be able to
capture the true underlying structure. Hence the tree size is the crucial tuning parameter, which
![Page 26: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/26.jpg)
3.1. BASICS OF TREE-BASED MODELS 18
needs to be chosen from the data. In order to control the tree size, we describe cost-complexity
pruning. Assume that we grew a large tree Tmax
such that no further splits were possible. We call
T ⇢ Tmax
a subtree if it can be obtained by pruning T , i.e. collapsing any number of nodes.
Definition 3.5 (Subtree). Let T1 = {t11, . . . , t1K1
, L11, . . . , L
1M1
} and T2 = {t21, . . . , t2K2
, L21, . . . , L
2M2
}be trees. Then T1 is a subtree of T2, write T1 ⇢ T2 if
1. root(T1) = root(T2),
2. {t11, . . . , t1K1
} ⇢ {t21, . . . , t2K2
}.
See Figure 3.2 for an illustrative example. Let |T | = M denote the number of leaves in T . For given
↵ � 0, we define the cost-complexity criterion
C↵
(T ) = R(T ) + ↵|T |, (3.4)
where
R(T ) =
|T |X
m=1
X
{i:xi
2Rm
}
(yi
� cm
)2
is the total sum squares of the tree T . The goal is to find the subtree T (↵) ⇢ Tmax
that minimizes
C↵
(T ). The tuning parameter ↵ controls how strong we penalize large trees. For ↵ = 0, we obtain
T (↵) = Tmax
, whereas if ↵ ! 1, the resulting subtree is only based on the root, which fits the
overall mean
fn
(x) =1
n
nX
i=1
yi
to all observations. Therefore minimizing C↵
(T ) leads to a subtree that can be seen as an optimal
trade-o↵ between a good fit and keeping the tree as small as possible.
Definition 3.6 (Optimally Pruned Subtree). Given ↵ � 0, a subtree T (↵) of T is called optimally
pruned subtree if the following holds true:
1. C↵
(T (↵)) = minT
0⇢T
C↵
(T 0),
2. if C↵
(T 0) = C↵
(T (↵)), then T (↵) ⇢ T 0.
By definition, if there exists such a subtree it is unique, since we choose it to be the smallest one.
The question arising is whether or not such an optimally pruned subtree exists. Note there are only
finitely many subtrees of T , hence we can always find at least one which fulfills the first conditions.
What if there exists two minimizers T1 and T2 of (3.4) such that neither is a subtree of the other? We
show in the next proposition that this situation will not occur, i.e. we can establish both uniqueness
and existence of the optimally pruned subtree.
![Page 27: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/27.jpg)
3.1. BASICS OF TREE-BASED MODELS 19
Proposition 3.7. Let T be a tree. For every ↵ � 0, there exists a unique optimally pruned subtree
T (↵). Additionally, if ↵1 ↵2, it holds that T (↵2) ⇢ T (↵1).
The proof is based on induction and can be found in Appendix A.2. It remains to give an e�cient
algorithm to determine the optimally pruned subtrees. Let #T be the number of subtrees of T . If
T is a root tree, then #T = 1. Otherwise, we have
#T = #TL
·#TR
+ 1,
where TL
and TR
are the two primary branches of T , see Figure A.1. This indicates that the
maximum number of subtrees grows rapidly in the number of leaves. However, there is no need
to search through all possible subtrees to find T (↵). When increasing ↵ the optimal subtree T (↵)
becomes smaller until it only consists of the root tree, see Figure 3.3. Since there are only finitely
many subtrees, we obtain an increasing sequence of ↵k
such that for ↵k
↵ < ↵k+1, T (↵) = T (↵
k
).
By Proposition 3.7, we know that for ↵k
↵ < ↵k+1, T (↵k+1) ⇢ T (↵
k
) = T (↵). In order to find
Figure 3.2 – The figure shows the tree T1 and all its subtrees. The response variable predicted by T1
is the claims frequency based on the covariates nationality group (nat gr) and age of driver (age). Thedotted arrows correspond to the pruning process described in Figure 3.3.
the sequence ↵k
with the corresponding optimally pruned subtree, we start with ↵1 = 0 and by
construction T (↵1) = Tmax
. To find ↵2 and T (↵2), we denote for any node t 2 Tmax
by Tt
the
subtree with root t and by {t} the tree consisting of the single node t. As long as
C↵
(Tt
) < C↵
({t}) (3.5)
![Page 28: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/28.jpg)
3.1. BASICS OF TREE-BASED MODELS 20
the branch Tt
has a smaller cost-complexity as the single node {t}. However solving the inequality
for ↵, leads to a critical value ↵ with
↵ <R({t})�R(T
t
)
|Tt
|� 1=: ↵,
where the inequality (3.5) becomes an equality and therefore {t} is preferable. Next we define the
function g(t) by
g(t) :=
8<
:
R({t})�R(Tt
)|T
t
|�1 if t 2 Tmax
,
+1 else,
where Tmax
denotes the set of all nodes. Then the weakest link t of Tmax
is given by
t = argmint2T
max
g(t),
meaning that when increasing ↵, t is the first node such that C↵
({t}) = C↵
(Tt
). Define ↵2 :=
g(t) = mint2T
max
g(t) and T the tree obtained by cutting away the branch Tt
. Then T minimizes C↵2(T )
over all subtrees of Tmax
. Hence by uniqueness, we have that T = T (↵2). Since we know that
T (↵3) ⇢ T (↵2), we can simply set Tmax
= T (↵2) and repeat the above procedure to obtain ↵3 as
well as T (↵3). Therefore the problem of finding the optimally pruned subtree can be implemented
R(T1)
R(T2)R(T3)
R(T4)
R(T5)
R(T6)
Cα(T)=
R(T)+
α|T|
α
Cost Complexity Pruning
α1 = 0 α2 = 6 α3 = 10 α4 = 11.75 α5 = 20.5
Figure 3.3 – Each line shows the cost-complexity criterion C↵(T ) for a subtree T of T1 of Figure 3.2.The plot illustrates the pruning process as a function of ↵. For a fixed ↵ there always exists a uniquesubtree minimizing C↵(T ). On the x-axis we see the sequence ↵k of cost-complexity parameters. Notethat T2 and T3 have the same slope, i.e. the same number of leaves. Since R(T2) < R(T3), we alwaysprefer T2 over T3. In the figure, C↵(T3) is shown by the dotted line.
![Page 29: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/29.jpg)
3.1. BASICS OF TREE-BASED MODELS 21
as a computationally rapid recursive pruning. Figure 3.3 shows the cost-complexity criteria for the
subtrees shown in Figure 3.2. The slopes represent the tree sizes and the y-intercepts the measure
of goodness of the respective trees. It illustrates how the optimally pruned subtree becomes smaller
when increasing ↵. It remains to determine the optimal choice of ↵. An estimation is usually obtained
by cross-validation, see Section 2.3.1 and Section 5.1.
Remark 3.8. In the rpart package the following scaled version of the cost-complexity criteria (3.4)
is used,
Ccp
(T ) = R(T ) + cpR(T (1))|T |,
with
cp :=↵
R(T (1)).
This has the advantage that for cp = 1, the optimally pruned tree is given by the root tree. To show
this, let T be a tree. Then the root tree T (1) is preferred over T for all ↵ � 0 with
C↵
(T (1)) C↵
(T ),
which is equivalent to
↵ � R(T (1))�R(T )
|T |� 1.
Since this holds for any tree T with R(T ) � 0 and |T | � 2, we can take the maximum over all T on
both sides, which is obtained at R(T ) = 0 and |T | = 2. Hence we get that ↵ � R(T (1)), which is
equivalent to cp � 1. Choosing cp = 0 is equivalent to ↵ = 0, hence the optimally pruned subtree
coincides with the fully grown tree. J
Remark 3.9. In practice it may take a large amount of computing time to fully grow a tree until no
further split is possible. Therefore, let T be the current tree. Denote by T (↵) the largest optimally
pruned subtree of T with |T | > |T (↵)|, i.e. a strict subtree. Let ↵ > ↵ such that T (↵) ⇢ T (↵)
denotes the next (strictly) smaller optimally pruned subtree. We stop the growing procedure if the
improvement of fit of the optimally pruned subtree T (↵) to the next larger optimally pruned subtree
T (↵) is smaller then a user-specified constant c 2 [0, 1]. According to the cost-complexity pruning
procedure, T (↵) is preferred over T (↵) for all ↵ � 0 such that
C↵
(T (↵)) C↵
(T (↵)).
This is equivalent to
↵ � R(T (↵))�R(T (↵))
|T (↵)|� |T (↵)| ,
or using the parametrization of the rpart package
cp � 1
|T (↵)|� |T (↵)|
✓R(T (↵))
R(T (1))� R(T (↵))
R(T (1))
◆. (3.6)
![Page 30: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/30.jpg)
3.2. MODIFICATIONS 22
The rpart implementation contains the option to bound the right-hand side of (3.6) by the constant
c, therefore imposing a condition on the maximum tree size. In combination with Remark 3.8, we
see that choosing c = 0 results in growing a tree as large as possible, whereas c = 1 only grows the
root tree. Also note that when using the squared error loss (as in Algorithm 3.1) the right-hand side
of equation (3.6) is equal to the improvement in the coe�cient of determination R2 scaled by the
di↵erence in number of leaves (see Chapter 7.2.2 in Wuthrich (2015)). J
Remark 3.10. We assumed that C↵
(T ) = R(T )+↵|T |, where R(T ) =P|T |
m=1
P{i:x
i
2Rm
}(yi� cm
)2.
However, we never used the explicit form of R(T ). Therefore, the results on cost-complexity pruning
are valid in more generality for any measure of goodness R(T ). J
3.2 Modifications
In this section, we discuss several extensions of Algorithm 3.1. The modifications are chosen in view
of the insurance data set we analyze in Part 2 of this thesis. Some of the proposed extensions were
originally used in cancer research to model survival data of cancer patients. The material of this
section is based on LeBlanc and Crowley (1991).
3.2.1 Poisson Regression Tree
Based on the idea of Section 2.4, we generalize Algorithm 3.1 to measures of goodness derived from
the deviance statistics. In this section, we describe the modification for a Poisson-distributed count
variable N . Given v > 0, we define the response, often called frequency,
Y =N
v= f(X) + ",
with N |X=x
⇠ Poi(�(x)v). For i.i.d. observations (x1, N1, v1), . . . , (xn
, Nn
, vn
) with vi
> 0, the
deviance statistics associated to the Poisson-distributed counts Ni
is given by
2nX
i=1
✓N
i
log
✓N
i
�(xi
)vi
◆� (N
i
� �(xi
)vi
)
◆, (3.7)
where the function Ni
7! Ni
logNi
can be continuously extended to N0 by setting
Ni
7!
8<
:N
i
logNi
if Ni
> 0,
0 if Ni
= 0.
This extension will be used throughout this text. See Appendix A.3 for the proof of equation (3.7).
Assuming that the data originates from a Poisson-distributed count variable N , we can adapt the
![Page 31: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/31.jpg)
3.2. MODIFICATIONS 23
minimization criterion in Algorithm 3.1 to find a splitting variable j and a split-point s such that
minj,s
"min�1
X
{i:xi
2R1(j,s)}
✓N
i
log
✓N
i
�1vi
◆� (N
i
� �1vi)
◆
+min�2
X
{i:xi
2R2(j,s)}
✓N
i
log
✓N
i
�2vi
◆� (N
i
� �2vi)
◆#,
where the inner minimizations are given by
�1 =1P
{i:xi
2R1(j,s)} vi
X
{i:xi
2R1(j,s)}
Ni
, resp. �2 =1P
{i:xi
2R2(j,s)} vi
X
{i:xi
2R2(j,s)}
Ni
. (3.8)
Equation (3.8) follows by Example 2.3, since minimizing the deviance statistics is equivalent to
maximizing the likelihood. The procedure leads to a maximum likelihood model for each leaf of
the tree. However we need to be careful, when using the minimizer (3.8), since it can happen that
either �1 = 0 or �2 = 0, e.g. if all Ni
are zero within the chosen group. In this case, the Poisson
model is not well defined (degenerate). A possible solution is the use of a Bayesian estimator for �.
The idea is to assume the data originates from a Poisson-Gamma model. The following results on
Poisson-Gamma models are based on Chapter 8 of Wuthrich (2015).
Definition 3.11 (Poisson-Gamma model). Given fixed values vi
> 0 for i = 1, . . . ,m, assume that:
1. Conditionally, given ⇤, the components of N = (N1, . . . , Nm
) are independent with Ni
⇠Poi(⇤v
i
).
2. ⇤ ⇠ �(�, c) with prior parameter � > 0 and c > 0.
We denote by �0 = E(⇤) = �/c the a priori expected frequency. However, having observed
N1, . . . , Nm
, the aim is to determine the posterior estimate for the expectation of ⇤ given by
�post := E(⇤|N1, . . . , Nm
).
Proposition 3.12. Assume N = (N1, . . . , Nm
) follows a Poisson-Gamma model of Definition 3.11,
then the posterior estimator �post has the following form
�post = ↵�+ (1� ↵)�0,
with credibility weights ↵ and observation based estimator � given by
↵ =
Pm
i=1 vic+
Pm
i=1 viand � =
1Pm
i=1 vi
mX
i=1
Ni
.
A proof for Proposition 3.12 can be found in Chapter 8 of Wuthrich (2015). Assume that the data
N
l
= (N1, . . . , Nm
l
) in each node l, where ml
denotes the number of observations in node l, is
![Page 32: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/32.jpg)
3.2. MODIFICATIONS 24
described by a Poisson-Gamma model. Setting �2 := V ar(⇤) = �/c2, the coe�cient of variation is
given by V co(⇤) = �/�0 = 1/p� =: k. We can write the prior distribution parameter c as
c =
r�
�2=
1
k�=
1
k2�0.
Denote by
�0 =1Pm
l
i=1 vi
m
lX
i=1
Ni
the overall maximum likelihood estimator as an estimate for the prior mean �0. Using Proposition
3.12, we get the estimator
�post
l
= ↵l
�l
+ (1� ↵l
)�0, (3.9)
where the weights ↵l
can be written as
↵l
=
Pm
l
i=1 vi1
k
2�0
+P
m
l
i=1 vi. (3.10)
Comparing the estimators (3.8) and (3.9), we see that the latter is a credibility weighted average of
the overall average and the Poisson MLE estimator. Note that estimator (3.9) is non-optimal in a
credibility context. One could instead use the homogeneous credibility estimator, where we would
choose
�0 =1
PM
l=1 ↵l
MX
l=1
↵l
�l
,
see Buhlmann and Gisler (2005).
Remark 3.13. Note that the coe�cient of variation k of ⇤ plays the role of a tuning parameter
for the credibility weights, i.e. ↵l
= ↵l
(k). However there may be a di↵erent ⇤l
associated to each
leaf l, hence k should be estimate separately for each possible split. The implementation available
within the rpart package does not take this into account and uses an a priori estimate k for all leaves.
Estimating k from data is further discussed in Section 5.3.1. J
The interpretation of the credibility weights ↵l
is as follows:
1. If k ! 0, then ↵l
! 0, i.e. when the variation of the prior tends to zero, we only believe in the
prior expectation �0.
2. If k ! 1, then ↵l
! 1, i.e. for an uninformative prior, we only believe in the observation
based estimator �l
.
3. IfP
m
l
i=1 vi ! 1 then ↵l
! 1, i.e. we only believe in the observation based estimator.
![Page 33: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/33.jpg)
3.2. MODIFICATIONS 25
Remark 3.14. Regarding an application to insurance data, the third property is rather interesting.
SinceP
m
l
i=1 vi usually denotes the volume of a subportfolio, the property formalizes that for large
portfolios, we believe more in the observation based estimator. J
Remark 3.15. Note that using estimator (3.9) instead of estimator (3.8), we cannot guarantee that a
further split decreases the deviance statistics. Consider a splitting variable j and a split-point s with
corresponding regions R1(j, s) and R2(j, s). Let µ(�) = (�v1, . . . ,�vn), µ1(�) = (�vi
: i 2 R1(j, s))
and µ2(�) = (�vi
: i 2 R2(j, s)). Then it follows that
D(y,µ(�)) = min�
D(y,µ(�)) = D(y1,µ1(�)) +D(y2,µ2(�))
� min�
D(y1,µ1(�)) + min�
D(y2,µ2(�))
= D(y1,µ1(�1)) +D(y2,µ2(�2)),
where D denotes the Poisson deviance statistics, y1 = (Ni
: i 2 R1(j, s)), y2 = (Ni
: i 2 R2(j, s))
and �, �1, �2 are derived from estimator (3.8). Using the Bayesian estimator (3.9) this does not
hold true in general. Hence it may occur that at some point, the best split does not lead to a
decrease in the deviance statistics. Note that the rpart implementation only considers splits which
decrease the deviance statistics. This may be problematic if there is a strong shrinkage to the prior
mean. However a close inspection of the rpart source code (see poisson.c-file, lines 199-200 of the
rpart source code, Therneau et al. (2015)) reveals that whilst growing the tree, the MLE estimator
(3.8) is used. Only the predictions in the leaves are obtained by the Bayesian estimator (3.9) (see
poisson.c-file, lines 119-128), where the parameter k ensures that � > 0 and that the Poisson model
is well-defined. Allowing � = 0 during the growing procedure is not problematic, since it implies that
the deviance statistics (3.7) becomes infinite. Such a split will not decrease the deviance statistics
and is therefore not considered by the algorithm anyway. J
Due to Remark 3.10, the pruning process described in Section 3.1.3 adapts directly to the use of the
deviance statistics as the new measure of goodness.
3.2.2 Weighted Least Squares
Alternatively to the likelihood based deviance statistics, the tree performance can also be assessed
by a weighted least squares measure. This method is very similar to Algorithm 3.1, except for the
use of weights wi
in the minimization criterion,
minj,s
2
4minc1
X
{i:xi
2R1(j,s)}
wi
(yi
� c1)2 +min
c2
X
{i:xi
2R2(j,s)}
wi
(yi
� c2)2
3
5 . (3.11)
![Page 34: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/34.jpg)
3.2. MODIFICATIONS 26
The inner minimizations are given by
c1 =1P
{i:xi
2R1(j,s)}wi
X
{i:xi
2R1(j,s)}
wi
yi
, resp. c2 =1P
{i:xi
2R1(j,s)}wi
X
{i:xi
2R2(j,s)}
wi
yi
. (3.12)
See Appendix A.4 for a proof of equation (3.12).
Remark 3.16. If wi
= 1 for i = 1, . . . , n, we obtain the classical case described in Section 3.1.2. J
Example 3.17. Analogously to Section 3.2.1, assume that for given v > 0 the response originates
from a count variable N such that
Y =N
v= f(X) + " with V ar(Y |X = x) = g(v), (3.13)
for a strictly decreasing function g : [0,1) ! [0,1). This means that the variance decreases in v,
which is called heteroscedasticity. Since every strictly decreasing function is bijective, there exists an
inverse function h : [0,1) ! [0,1) of g(·), which is again strictly decreasing. Assume we are given
the i.i.d. observations (x1, N1, v1), . . . , (xn
, Nn
, vn
) with yi
= Ni
/vi
. When choosing the weights
wi
= vi
, the weighted sum of squares is given by
nX
i=1
wi
(yi
� c)2 =nX
i=1
vi
(yi
� c)2 =nX
i=1
h(V ar(Y |X = x
i
))(yi
� c)2,
i.e. observations with large variance are less weighted within the optimization criterion. By (3.13)
and wi
= vi
the minimizer (3.12) is given by
� =1Pn
i=1 vi
nX
i=1
Ni
,
which coincides with the estimator used for Poisson regression trees without making any distribu-
tional assumptions. However in general, the resulting trees do not coincide since the optimization
criteria di↵er. Moreover, there is no di�culty with � = 0, although this does not make sense in the
context of modeling claims frequencies.
Alternatively we could use wi
= 1/V ar(Y |X = x
i
) = 1/g(vi
), then the minimization criterion
becomesnX
i=1
wi
(yi
� c)2 =nX
i=1
yi
� cpg(v
i
)
!2
.
This means that we first normalize the variance of the residuals to one and then consider the squared
error. However, to obtain the function g(·) we need to make distributional assumptions such that
we are able to derive g(·) = V ar(Y |X = ·). Also note that g(vi
) may depend on f(xi
), therefore in
general this approach is not compatible with the rpart implementation. ?
![Page 35: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/35.jpg)
3.2. MODIFICATIONS 27
3.2.3 Classification Tree
So far we always considered continuous response variables. However in many applications, we may
face the situation that the response Y only takes finitely many discrete values {0, . . . , J � 1}, e.g.the number of claims of an insurance policy. When dealing with predicting a discrete response, one
usually refers to classification instead of regression. The model (2.1) needs to be slightly modified,
requiring f : Rp ! {0, 1, . . . , J � 1} and the error term to be discrete as well. We then call the
function f a classifier and the problem of determining f a classification problem. Fortunately, the
tree building algorithm described in Section 3.1 can easily be adapted to classification problems.
The only changes we have to make are the choice of the splitting criterion and the corresponding
estimator for the constant in each leaf. For a leaf m, recall that Rm
denotes its region in the covariate
space and nm
the number of observations in Rm
. The idea is to find splits that partition the data
into subsets which are more pure, measured by an impurity function, than before the splitting.
Definition 3.18 (Classification Rate). For each node m and class k the classification rate p(m, k)
is given by
p(m, k) =1
nm
X
{i:xi
2Rm
}
1{yi
=k}.
It holds that
p(m, 0) + . . .+ p(m,J � 1) = 1.
Definition 3.19 (Impurity Function). An impurity function � is defined on all J-tuples (p0, . . . , pJ�1)
with pj
� 0 andP
J�1j=0 p
j
= 1 satisfying the properties:
1. � is only maximal at the point ( 1J
, . . . , 1J
).
2. � is minimal in the points (1, 0, . . . , 0), . . . , (0, . . . , 0, 1).
3. � is symmetric in p0, . . . , pJ�1, i.e.
�(p0, . . . , pJ�1) = �(p⇡(0), . . . , p⇡(J�1)),
for any permutation ⇡ : {0, . . . , J � 1} ! {0, . . . , J � 1}.
Given an impurity function �, we can determine the impurity of a leaf m by
�(p(m, 0), . . . , p(m,J � 1)).
By definition of the impurity function, it is maximal if in a leaf all classes occur equally often resp.
minimal if there is only one class in a leaf. The impurity function � most commonly used in the
![Page 36: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/36.jpg)
3.3. ADDITIONAL FEATURES 28
context of classification trees is called Gini index and defined by
G(p(m, 0), . . . , p(m,J � 1)) :=X
k 6=l
p(m, k)p(m, l) =J�1X
k=0
p(m, k)(1� p(m, k)).
The Gini index can be interpreted as follows: Assume that each element in a leaf m is classified
to class k with probability p(m, k). Denote by Y a predictor for Y , then the training error, i.e the
probability of misclassification, is given by
P (Y 6= Y ) = E⇣1{Y 6=Y }
⌘= E
⇣E⇣1{Y 6=Y }
⌘|Y⌘=
J�1X
k=0
P (Y = k)E⇣1{Y 6=Y }|Y = k
⌘
=J�1X
k=0
P (Y = k)X
l2{0,...,J�1}\{k}
p(m, l) =X
k 6=l
P (Y = k)p(m, l).
Hence the Gini index denotes an estimate of the training error. Using the impurity function as a
measure of goodness, in each step of the recursion, the tree algorithm chooses the splitting variable
j and split-point s such that the impurity is improved the most. Let m be a leaf, for each pair (j, s),
we get two child leaves mL
(j, s) and mR
(j, s). We modify the algorithm such that we aim to find
(j, s) with
(j, s) = argmaxj,s
[G(m)� [G(mL
(j, s)) +G(mR
(j, s))]] ,
where G(m) = G(p(m, 0), . . . , p(m,J � 1)). Having determined the splits, we use the majority vote
k(m) = argmaxk
p(m, k),
within each leaf to determine the classifier. With a deterministic rule if there are two maximal classes
with the same vote.
3.3 Additional Features
3.3.1 Categorical Covariates
Often in practice, we encounter both numerical and categorical covariates. When splitting a categor-
ical covariate having q possibly unordered values, there are 2q�1�1 possible partitions of the q values
in two disjoint groups. Unfortunately, the computational burden of an exhaustive search through all
partitions become immense if q is large. However, we can substitute the categorical covariate by a
numerical covariate containing the mean of the response for the respective covariate levels and apply
the tree algorithm to the new numerical covariate. According to Hastie et al. (2001), one can show
that this leads to the optimal split under the squared error loss function.
![Page 37: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/37.jpg)
3.3. ADDITIONAL FEATURES 29
Example 3.20. Assume that an insurance company stores the number of claims (nc), the duration
of the insurance contract (jr) and the sex for each policyholder (pol). The aim of the insurer is to
predict the claims frequency (claims / duration) using the sex of the policyholder. In order apply the
binary splitting algorithm to the covariate sex, a temporary (numerical) covariate fsex is generated,
storing the mean of the response for the respective covariate levels, as illustrated in Table 3.1. The
tree algorithm is then applied to the covariate fsex.
pol 1 2 3 4 5 6 7 8 9 10
nc 1 0 0 1 0 0 0 1 1 0
jr 1 1 1 1 1 1 1 1 1 1
sex m m w m w w m w m m
fsex 0.5 0.5 0.25 0.5 0.25 0.25 0.5 0.25 0.5 0.5
Table 3.1 – An example on how categorical covariates (e.g. sex ) get transformed to numerical covariates(e.g. fsex) when fitting a regression tree. The data contains six male drivers, which have observed twoclaims, hence the mean of the outcome is given by 0.5, analogously for the female drivers.
?
3.3.2 Missing Data and Surrogate Splits
In this section we are not concerned about how we choose the splits, but rather how we deal with
missing data. Many methods in statistics deal with missing data by refusing to deal with them, i.e.
deleting observation with missing covariates in advanced. We usually have two objectives in mind
when considering missing data: First of all, make maximum use of the given observations even if
some covariates are missing. Secondly, being able to make prediction for cases with some covariate
values missing. Having the first objective in mind, it is not satisfying to discard any non-complete
observation during the training process. Possible solutions are:
1. Filling in the missing values, e.g. using the mean of the non-missing predictor variables.
2. For categorical variables, we can introduce a new category ”missing”.
3. Make use of surrogate variables.
In the following, we explain surrogate splits in more details. When considering a predictor variable
for a split, we only use the observations for which that covariate is not missing. This leads to a
primary splitting variable and split-point. Before continuing with the binary splitting algorithm, we
figure out which other covariate and corresponding split would best mimic the primary split, see
Algorithm 3.2. For example, assume that at the current stage the algorithm decided to split the
variable age of driver at the splitting point (age of driver 40, age of driver > 40). After that we
![Page 38: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/38.jpg)
3.3. ADDITIONAL FEATURES 30
Algorithm 3.2 Calculation of the surrogate variables
Precondition: Let t = (N, j, s) be a node according to Definition 3.1.
function surrogate(node = (N, j, s))# First check whether the split sends an observations to the left or to the right:yk
:= 1{xjk
s} for all k 2 N .
# Calculate the surrogate splits:for i 2 {1, . . . , p}\{j} do
surrogate split[i] := class tree(response = y, covariate = xj
, size = 2)yk
:= predict(surrogate split[i], xik
), for all k 2 N
misclass[i] :=P
1{yk
6=y
k
}|N |
end forend function
use the tree algorithm to predict the two categories {age of driver 40} and {age of driver > 40}.For that purpose we fit a univariate classification tree with two leaves, modeling the discrete response
Y = 1{age of driver 40}, for each covariate except the one used in the primary split. This leads to a
sequence of surrogate variables, which can be ordered according to their misclassification errors. If
we need to evaluate the tree predictor during the training or testing phase at a point where we have
missing data, we can use the surrogate splits instead. The concept of surrogate splits was introduced
by Breiman et al. (1984) trying to reduce the e↵ect of missing data by exploiting correlations between
the predictor variables.
3.3.3 Variable Importance
In this section, we address the question, which are the most important covariates in a tree predictor?
In the literature, there are slightly di↵erent notions of variable importance related to regression and
classification trees. The definition we use is based on Chapter 5.3.4 of Breiman et al. (1984).
A critical issue regarding variable importance is how to deal with variables, which do not give
the best split for a node but may be second or third best. For example, covariate x1 may never occur
as a split in the final tree. However, if a second covariate x2 is removed and a new tree is grown,
x1 may occur very prominently in the splits with a tree almost as accurate as the original. In such
situations, we require that a measure of variable importance is capable to detect the importance of
x1. The solution proposed by Breiman et al. (1984) is based on using both primary and surrogate
splits, see Section 3.3.2. Let T be the final optimal subtree, and let R be the measure of goodness
used to grow the tree. At each node t 2 T , we define �R(t, x) the decrease in the loss due to splitting
variable x, where x may be a primary or surrogate splitting variable.
![Page 39: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/39.jpg)
3.3. ADDITIONAL FEATURES 31
Definition 3.21 (Variable Importance). The measure of importance of variable x is defined as
M(x) =X
t2T�R(x, t).
In Appendix C.1, we provide an implementation of the variable importance function according to
Definition 3.21, which is compatible with the rpart package. The measure is often normalized to
100 for the most important covariate. Unfortunately, variable importance does not contain any
information about the direction of influence of a particular covariate on the response.
Example 3.22. Analogously to Example 3.20, we assume that an insurance company wants to
predict the claims frequency based on the covariates age of driver (age), price of car (price) as well
as policynumber (pol). Figure 3.4 shows the variable importance for an example tree predictor. As
expected, we see that both age and price seem to occur prominently in the tree. The policynumber
is not used at all and hence results in a variable importance of zero.
age price pol
Varia
ble
Impo
rtanc
e0
2040
6080
100
Figure 3.4 – Variable importance plot for Example 3.22.
?
Remark 3.23. Note that the rpart packages also provides a build-in function for variable importance.
However, the implementation multiplies the improvement �R(x, t) by an agreement factor if x was
used as a surrogate split in node t. The agreement factor takes into account to what extent the
surrogate split is capable to mimic the primary split. For our purpose, we will work with the original
definition given in Breiman et al. (1984). J
![Page 40: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/40.jpg)
3.4. RANDOM FOREST 32
3.4 Random Forest
In practice, classification and regression trees have shown to be rather unstable, in the sense that
small changes in the training data may lead to large changes in the predictions, see also Section 3.4.2.
In this section we present the random forest algorithm introduced by Breiman (2001), which tries
to reduce the variability of an individual tree predictor. Conceptually, random forest is based on
combining many tree predictors. For that purpose, we first introduce bagging a key building block
of random forest. We call methods derived from many individual tree predictors tree ensembles.
3.4.1 Bagging
Bagging was introduced by Breiman (1996) and consists of the two elements bootstrapping and
aggregating. We first discuss the concept of aggregation.
Aggregating
Assume we have a training data set L = {(xi
, yi
) : i = 1, . . . , n}, where (x1, y1), . . . , (xn
, yn
) are
i.i.d. drawn from a distribution P . Using this training data we obtain the predictor f(x,L). Now
suppose that we are given a sequence of training data sets {Lk
}k�1 each consisting of n observations
drawn from the same distribution. Instead of one single predictor f(x,Lk
), the idea of aggregation
is to use the average over the training sets {Lk
}k�1, i.e.
fagg
(x) = limK!1
1
K
KX
k=1
f(x,Lk
) = E(f(x,L)), P -a.s.
as a new predictor. Note the convergence follows by the law of large numbers. An important property
of aggregating is that it does not change the bias since,
Bias(fagg
(x)) = f(x)� E(fagg
(x)) = f(x)� fagg
(x) (3.14)
= f(x)� E(f(x,L)) = Bias(f(x,L)).
However, aggregating can substantially increase the predictive capacity measured by the mean
squared error. Fix a data point (x, Y ) independent from the learning set L, then
E((Y � f(x,L))2|Y ) = Y 2 � 2Y E(f(x,L)|Y ) + E(f(x,L)2|Y ).
Due to independence of Y and L, E(f(x,L)|Y ) = E(f(x,L)) = fagg
(x) and by Jensen’s inequality
it follows that E(f(x,L)2) � E(f(x,L))2. Together this implies that
E((Y � f(x,L))2|Y ) � (Y � fagg
(x))2.
![Page 41: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/41.jpg)
3.4. RANDOM FOREST 33
Taking the expectation with respect to Y on both sides gives
E((Y � f(x,L))2) � E((Y � fagg
(x))2), (3.15)
i.e. aggregating reduces the conditional mean squared error. The magnitude of the decrease depends
on how unequal the two sides of
E(f(x,L))2 E(f(x,L)2)
are, or equivalently the bigger the variance V ar(f(x,L)) the larger the reduction in the conditional
mean squared error. Combing Proposition 2.2 and equation (3.15), it follows that
E((Y � f(x,L))2) = Bias⇣f(x,L)
⌘2+ V ar(f(x,L)) + V ar(")
� E((Y � fagg
(x))2)
= Bias⇣fagg
(x)⌘2
+ V ar(fagg
(x)) + V ar(").
However since Bias(f(x,L)) = Bias(fagg
(x)), we conclude that the improvement in the condi-
tional mean squared error purely originates from an improvement of the variance of the aggregated
predictor. By equation (3.15), we also obtain a decrease in the mean squared error since
E((Y � f(X,L))2) = E(E((Y � f(X,L))2|X))
� E(E((Y � fagg
(X))2|X)) = E((Y � fagg
(X))2).
We can conclude that aggregating improves the predictive capacity of an estimation procedure in
terms of the mean squared error by reducing its variance.
Bagging
Unfortunately, we usually do not have the luxury of arbitrarily many duplicates of L. Instead,
the idea of bagging is to take bootstrap samples {L(b)}b=1,...,B from L to determine a bootstrapped
predictor for fagg
analogously to Section 2.3.2. The bootstrapped predictor is obtained by taking
the average over the bootstrap samples resulting in the predictor
fbagg
(x) =1
B
BX
b=1
f(x,L(b)). (3.16)
For B ! 1, we have that
limB!1
1
B
BX
b=1
f(x,L(b)) = EP
N
(f(x,L(1))), Pn
-a.s.
![Page 42: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/42.jpg)
3.4. RANDOM FOREST 34
This procedure is called bagging, an acronym for bootstrap aggregating. Whether bagging improves
accuracy depends largely on the stability of the procedure on which f(x,L) is based. If changes in
the training data L produce small changes in f(x,L), i.e. small variance V ar(f(x,L)), then fagg
(x)
will be close to f(x,L). However large improvement will occur if the base procedure is unstable,
where small changes in L can result in large changes in f(x,L). On the other hand, the impact
of bootstrapping will also depend on the stability of the base procedure and the richness of the
data. If we consider a stable method, the loss in accuracy due to the transition from f(x,L) to the
estimate based on the bootstrap sample f(x,L(1)) will be small, since the bootstrapped sample L(1)
is expected to be similar to L. On the other hand for unstable methods, even small changes result
in large changes of the estimate. And so we expect f(x,L) to be better than f(x,L(1)). In chapter
4 of Breiman (1996), an example for the cross-over point between instability and stability, beyond
which the bagged predictor fbagg
(x) does worse than f(x,L), is given.
Remark 3.24. As shown in the present section, aggregation in general reduces the mean squared
error. Since the bias is left untouched, this results in a reduction of the variance. When considering
di↵erent loss functions, e.g. the deviance statistics, there does not exists a similar decomposition into
bias and variance. Hence we cannot draw the same conclusion of an improvement of the predictive
capacity due to variance reduction. However empirical results for simulated insurance claims data
show that the variance reduction due to aggregating also positively e↵ects the deviance statistics,
see Section 6.2.3. J
3.4.2 Bagged Tree Predictor
So far, all considerations on bagging are fully general and applicable to any estimation procedure.
However, as shown above, the improvements of aggregation are in particular notable for procedures
with high variance V ar(f(x,L)). In this section, we discuss two di↵erent sources of variability
contained in the tree algorithm. Based on these considerations, we conclude that bagging has a large
potential of improving the generalization performance of classification and regression trees. The
literature of this section is based on Berk (2008) as well as Shalizi (2009).
The first source of uncertainty of a tree predictor lies in the variability of the estimates within a
given partition. This issue can be solves by ensuring that the leaves have a minimal size. In the rpart
function this is controlled by the parameter minbucket. In Section 5.3.1 we show how a reasonable
choice for insurance claims data can be derived. The second source lies in how splitting variables
are chosen. In each node the sample size may be significantly reduced, which very often leads to
the situation, where several splitting variables result in a similar reduction of the loss function.
Then already small changes in the training data may lead to choosing a di↵erent splitting variable.
This e↵ect is amplified for highly correlated covariates. Since the algorithm proceeds in stage-wise
manner, a chosen split a↵ects all succeeding splits. If changes occur early in the tree the impacts are
even bigger. Based to these arguments, we have conducted a simulation study in the second part of
![Page 43: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/43.jpg)
3.4. RANDOM FOREST 35
the thesis (see Section 5.3.7) comparing the variability of the tree predictor to a generalized linear
regression model. The findings in Section 5.3.7 support the intuition of high variance related to tree
models.
3.4.3 Random Forest
For bagging we saw that the improvement comes purely from variance reduction. Therefore, we
further elaborate the variance of the tree predictor. Since the L(b) are identically distributed, denote
for all b = 1, . . . , B, the variance of a bootstrapped tree by
�2(x) = V ar(f(x,L(b))).
The next lemma gives more detailed insight into the variance of the bagged tree predictor.
Lemma 3.25. Assume that the sampling correlation between any pair of tree predictors is equal and
given by
⇢(x) = corr(f(x,L(i)), f(x,L(j))), for i 6= j.
Then the variance of the bagged tree predictor can be written as
V ar(fbagg
(x)) = ⇢(x)�2(x) +1� ⇢(x)
B�2(x). (3.17)
Proof. The proof follows by a direct calculation,
V ar(fbagg
(x)) = V ar
1
B
BX
b=1
f(x,L(b))
!=
1
B2
BX
i,j=1
Cov(f(x,L(i)), f(x,L(j)))
=1
B2
0
@BX
i,j=1,i 6=j
Cov(f(x,L(i)), f(x,L(j))) +BX
i=1
V ar(f(x,L(i)))
1
A
=1
B2
�(B2 �B)�2(x)⇢(x) +B�2(x)
�
= ⇢(x)�2(x) +1� ⇢(x)
B�2(x),
where in the third equality we used that Cov(f(x,L(i)), f(x,L(j))) = �2(x)⇢(x), as well as the fact
that the first sum has B2 �B summands.
As B ! 1, the second term tends to zero, but the first term remains. Hence the size of the
correlation between a pair of tree predictors limits the benefits of averaging.
The idea of Breiman (2001) is to decouple the single trees used for aggregation in order to decrease
the correlation and hence further improve the predictor. This is achieved by random covariate
selection: When growing a tree, in each step of the recursion, we randomly choose m of the p
![Page 44: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/44.jpg)
3.4. RANDOM FOREST 36
covariates. We then only search for the optimal split within these m covariates. We denote by
f(x,⇥(L)) a single random forest tree, where ⇥(L) characterizes the random forest tree in terms of
splitting variables, split-points at each node and leaf values. Analogously to bagging, the random
forest predictor is obtained by aggregation, see Algorithm 3.3. For an individual random forest tree
Algorithm 3.3 Random Forest for Regression
Denote by L a learning sample.for i = 1, . . . , B do
1. Generate a bootstrap sample L⇤ by drawing with replacement from the data L.2. Grow a tree to the bootstrapped data, by recursively repeating the following steps for each
leaf of the tree, until the maximum number of leaves (maxnodes) is reached or no furthersplit is possible:
(a) Randomly select m of the p variables.
(b) Split the leaf according to the best variable/split-point among the m covariates.
Resulting in a random forest tree f(x,⇥(L)), where ⇥(L) characterizes the random foresttree in terms of splitting variables, split-points at each node and leaf values.
end for
Based on the ensemble of trees {f(x,⇥b
(L))}b=1,...,B, output the random forest predictor
fB
rf
(x) =1
B
BX
b=1
f(x,⇥b
(L)). (3.18)
the sampling correlation ⇢m
(x) and sampling variance �2m
(x) are given by
⇢m
(x) = corr(f(x,⇥i
(L)), f(x,⇥j
(L))), for i 6= j
resp.
�2m
(x) = V ar(f(x,⇥(L))),
with ⇢m
(x) = ⇢(x) resp. �2m
(x) = �2(x) for m = p. By equation (3.18), in the limiting case for
B ! 1, the random forest predictor is given by
frf
(x) = E(f(x,⇥(L))|L), Pn
-a.s.
Analogously to equation (3.17), the variance of the random forest predictor is given by
V ar(frf
(x)) = ⇢m
(x)�2m
(x).
When decreasing m we expect a decrease in the sample correlation ⇢m
(x), since the randomized
variable selection decouples the individual random forest trees (see Figure 3.5). For the variance
![Page 45: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/45.jpg)
3.4. RANDOM FOREST 37
of a single random forest tree �2m
(x), the e↵ects are less obvious. Applying the tower property for
conditional variance, we get for the variance of a single tree predictor
�2m
(x) = V ar(E(f(x,⇥(L))|L)) + E(V ar(f(x,⇥(L))|L))
= V ar(frf
(x)) + E(V ar(f(x,⇥(L))|L)),
where the first term is the variance of the random forest predictor, which we hope is decreasing, when
decreasing m. The second term denotes the variance due to randomization, which is going to increase
when decreasing m, since there is more randomization. Overall the parameter m should therefore
be seen as a tuning parameter, which controls the possible improvements due to randomization.
In order to illustrate the e↵ects of the tuning parameter m upon the random forest predictor, we
have simulated from
Y =1p50
50X
i=1
Xi
+ ",
where all Xi
and " are i.i.d. normally-distributed with mean zero and variance one. We have used
500 training sets of size 100 and one test set with 600 observations. Figure 3.5 shows the sample
correlation ⇢m
(x), the tree variances �2m
(x) as well as their product V ar(frf
(x)) = ⇢m
(x)�2m
(x)
as a function of the number of randomly selected splitting variables m. The boxplots represent
the respective values at the 600 testing data points. This shows that we can reduce the variance
●
●
●
●●●
●
●●
●
●
●
●
●
●●
●
●
●
●
●●
●●●
●
●
●
●
●●
●
●●
●
●●●
●
●
●
●
●
●●
●●
●
●
●●
●
●
●●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●●
●
●
● ●●
●
●
● ●
●
●
●
●
1 7 13 19 25 31 37 43 49
0.04
0.06
0.08
0.10
0.12
m
Sam
ple
corre
latio
ns b
etwe
en tr
ees
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●●
●
●
●●
●
●●
●
●
1 7 13 19 25 31 37 43 49
1.0
1.5
2.0
m
Varia
nces
of s
ingl
e ra
ndom
fore
st tr
ee
●●
●●●●
●
●
●
●
●
●
●
●
●
●●
●●●●●
●●●●
●
●●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
1 7 13 19 25 31 37 43 49
0.05
0.10
0.15
m
Varia
nce
of ra
ndom
fore
st
Figure 3.5 – Simulation study for reducing the variance due to a reduction of the sample correlationbetween individual random forest trees.
of the random forest predictor drastically, when reducing m. In order to eventually improve the
predictive capacity, we also need to keep the bias in mind. The randomization and reduction of
the sample space when growing a random forest tree typically increases the bias. In Figure 3.6,
we have shown the average bias, average variance and mean squared error (see equation (2.4)) as
a function of m. We observe the mentioned increase in bias, when decreasing m, i.e. reducing the
sample space and increasing the randomization. Therefore we face a classical bias-variance-trade-o↵
in the choice of m. In the literature it is often suggested to use m = bp/3c, where p denotes the
![Page 46: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/46.jpg)
3.4. RANDOM FOREST 38
●
●
●● ● ● ● ● ● ● ● ● ● ● ● ● ●
0 10 20 30 40 500.
750.
800.
850.
900.
951.
00
m
Mea
n Sq
uare
d Er
ror a
nd B
ias
Squa
red
●
●
●
●
●●
●●
● ● ● ● ● ● ● ● ●
●
●
●
●
●
●●
●●
●●
●●
●●
●●
m
0.00
0.03
0.06
0.09
0.12
0.15
Varia
nce
Mean Squared ErrorAverage Bias SquaredAverage Variance
Figure 3.6 – Bias-variance-trade-o↵ of the random forest predictor as a function of the tuning parameterm.
number of covariates. However m should be seen as a tuning parameter, and di↵erent values should
be tested in practice. So far we have not discussed the parameter maxnodes, which limits the size
of the individual random forest trees. Based on the findings in Part 2 on simulated and real data,
we suggest to choose maxnodes slightly larger than the size of an individual tree predictor chosen
according to the minimum cross-validation error rule, see Section 5.3.3 below.
3.4.4 Model Visualization
Since trees allow for very complex interactions between the predictor variables it is di�cult to
understand the individual influence of each covariate. For single tree predictors, we can use the tree
representation to visualize the predictor as well as the interactions. For tree ensembles, there does
not exists an extension of the tree representation. In the literature, such models are often referred
to as Black Box Models. In this section, we discuss three tools for black box visualization: Variable
Importance, Partial Dependence Plots and Individual Conditional Expectation Plots introduced in
Goldstein et al. (2014).
Variable Importance
For black box models a measure of variable importance is even more desired. Fortunately, we can
generalize the idea of variable importance for single tree predictors introduced in Section 3.3.3 to tree
ensembles. We proceed in calculating the variable importance for each covariate of a tree predictor
used in a tree ensemble and output the average. Formally, let T1, . . . , Tm
be the single tree predictors
contained in the ensemble. For each tree, we can calculate the variable importanceM(x, T ) according
to Definition 3.21, where x 2 {x1, . . . , xp} and T 2 {T1, . . . , Tm
}. The variable importance for the
tree ensemble is given by
M ens(x) =1
m
mX
i=1
M(x, Ti
).
![Page 47: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/47.jpg)
3.4. RANDOM FOREST 39
Partial Dependence and Individual Conditional Expectation Plots
The partial dependence plot shows the influence of a covariate Xi
for i 2 {1, . . . , p} when all other
covariates are integrated out. Formally, fix a covariate index i 2 {1, . . . , p} and define the set of
remaining covariates C := {1, . . . , p}\{i}. Denote by X
C
:= (Xj
: j 2 C) 2 Rp�1 the vector of the
corresponding covariates. Then the partial dependence function fi
: R ! R is given by
fi
(x) = E(f(Xi
,XC
)|Xi
= x) =
Zf(x,X
C
)dP (XC
), (3.19)
where f denotes the model function (2.1) and P (XC
) denotes the distribution of XC
. As neither f
nor P (XC
) are known, we estimate equation (3.19) by
fi
(x) =1
n
nX
j=1
fn
(x,xj,C
), (3.20)
where x
j,C
, for j = 1, . . . , n denote the observed values of the covariates X
C
. Plotting x against
fi
(x) displays the average value of fn
(·) as a function of the i-th covariate. However, the results must
be interpreted with caution.
Example 3.26. This example is taken from Goldstein et al. (2014). Consider the data generating
process
Y = 0.2X1 � 5X2 + 10X21{X3�0} + ", (3.21)
where " ⇠ N (0, 1) and X1, X2, X3i.i.d.⇠ U [�1, 1]. We have simulated 1’000 observations from model
(3.21) and fit a random forest predictor using the R package randomForest. Figure 3.7 shows the
true values of f(x1, x2, x3) = 0.2x1 � 5x2 + 10x21{x3�0} versus x2 for the observed data. The right
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
● ●
●●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
● ●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
● ●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●●
●
● ●
●●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●●
●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●
●●
●
●●
●
●
●
●
●
●
●
●
●
−1.0 −0.5 0.0 0.5 1.0
−4−2
02
4
x2
f
(a) Scatter plot of f(x1, x2, x3) versus x2
−1.0 −0.5 0.0 0.5 1.0
−1.0
−0.5
0.0
0.5
1.0
x2
f.hat_2
(b) Partial dependence plot for X2
Figure 3.7 – Example of a misleading partial dependence plot for dependent covariates.
![Page 48: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/48.jpg)
3.4. RANDOM FOREST 40
panel of Figure 3.7 shows the corresponding partial dependence plot for covariate X2. It suggest
that on average X2 is not associated with Y . However from model equation (3.21), we see that X2
has either a positive e↵ect on Y if X3 � 0 or a negative e↵ect if X3 < 0. In the partial dependence
function these e↵ects chancel each other out due to symmetry. Therefore the partial dependence
plot may be misleading, if the model contains interactions. ?
Apart from dependencies between the covariates also their marginal distribution plays a crucial role
in the quality and interpretability of the partial dependence plot.
Example 3.27. Note that the estimation in (3.20) is twofold: First of all it is an estimate of the
expected value in equation (3.19). But it is also an estimate of the unknown model function f(·). Thisshould be kept in mind when using the partial dependence plot. Consider the following conceptual
model shown in Figure 3.8a, which attains the values 0 or 1 according to the covariates X1 and X2.
The blue lines denote the marginal distributions of the covariates X1 and X2. In the top left corner,
we have particularly few observations. On the other hand, in the bottom right corner we have the
majority of the observed data points. When fitting a model, this may largely distort the predictor.
In Figure 3.8b a conceptual example of a fitted model is shown, which is distorted by the marginal
distribution of the covariates. Therefore the quality of the estimated partial dependence function
also depends on the quality of the fitted predictor. However note that the marginal distributions are
1 0
0 1
X2
X1
(a) Model with marginal distribution
0 0
1 1 X1
X2
(b) Distorted model fit
Figure 3.8 – Example of a distorted fit due to skewed marginal distributions.
also of relevance for the interpretation of the partial dependence plot if the fitted model successfully
recovers the characteristics shown in Figure 3.8a. In Figure 3.9 the partial dependence function for
the covariate X1 corresponding to the model in Figure 3.8a is shown. We see that the resulting
function is heavily distorted by the skewed marginal distribution of the covariate X2. For small
values of X1, the partial dependence function is nearly zero since the contribution of covariate values
for which the model attains 1 is very small due to the skewness of X2, analogously for large values
of X1.
![Page 49: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/49.jpg)
3.4. RANDOM FOREST 41
0.0
0.2
0.4
0.6
0.8
1.0
Parti
al D
epen
denc
e Fu
nctio
nX1
Figure 3.9 – Distorted partial dependence function due to skewed marginal distributions.
?
The individual conditional expectation (ICE) plot disaggregates the output of the partial dependence
plot. Rather than plotting the average, we may also plot all n estimated curves fn
(x,xj,C
) for
j = 1, . . . , n. The terminology individual conditional expectation plot comes form the fact that
fn
(x) is an estimate for the conditional expectation f(x) = E(Y |X = x), see Section 2.1. Each
curve fn
(x,xj,C
) shows the predicted response as a function of the i-th covariate for the remaining
covariates xj,C
fixed. Figure 3.10 shows the ICE plot for Example 3.26. The black line denotes the
estimated partial dependence function. Parallel curves indicate that the e↵ect of the i-th covariate
on the response Y does not depend on other covariates. Figure 3.10 on the other hand shows clear
dependence of X2 on further covariates.
−1.0 −0.5 0.0 0.5 1.0
−4−2
02
4
x2
f.hat_2
Figure 3.10 – ICE plot corresponding to Example 3.26. The intersecting lines suggest the presence ofan interaction between X2 and further covariates.
![Page 50: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/50.jpg)
Part II
Data Analysis
![Page 51: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/51.jpg)
Chapter 4
Motor Insurance Collision Data
In this chapter, we introduce the general objectives in pricing of non-life insurance contracts based
on Wuthrich (2015). We then describe the Motor Insurance Collision Data Set of AXA Winterthur
as well as a data generating process derived from the characteristics observed in the collision data
set.
4.1 Non-Life Insurance Pricing
”Non-life insurance pricing is the art of setting the pricing of an insurance policy, taking into con-
sideration various properties of the insured object and the policyholder. The main source on which
to base the decision is the insurance company’s own historical data on policies and claims. In a
tari↵ analysis, an actuary uses this data to find a model which describes how the claim cost of an
insurance policy depends on a number of explanatory variables.” (see preface, page vii, of Ohlsson
and Johansson (2010)).
A non-life insurance policy may cover damages on a car, house or other property, or losses caused
to another person (third party liability). For a company, the insurance may cover property damages,
cost of business interruption and more. In e↵ect, any insurance that is not a life insurance product
is classified non-life insurance sometimes also called property and casualty (P&C) insurance. Since
not every policy has the same expected loss, there is a need for statistical methods to capture this
feature. Historically, the pricing of insurance contracts has been strongly regulated and did not
involve much statistical analysis. In fact, Switzerland was the last European country to get rid
of the cartel rates regulations in 1996. Since then non-life insurance pricing has changed towards
charging a fair premium in the sense that each policyholder pays a premium that corresponds to the
expected losses transferred to the insurance company. Before we start with the statistics, we define
a mathematical framework suitable for modeling insurance claims associated to a policyholder.
Definition 4.1 (Total Claim Amount). The total claim amount Sm
of policy m is given by the
43
![Page 52: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/52.jpg)
4.1. NON-LIFE INSURANCE PRICING 44
compound distribution
Sm
= Y(m)1 + . . .+ Y
(m)N
m
=N
mX
i=1
Y(m)i
,
where
1. the claims count Nm
is a discrete random variable supported on A ⇢ N0,
2. the claim size (Y (m)i
)i�1
i.i.d.⇠ Gm
with Gm
(0) = 0,
3.⇣Y
(m)i
⌘
i�1and N
m
are independent.
Since Nm
plays the role of the number of claims of policyholder m, it is natural to assume that it
may only take integer values. In the context of insurance pricing, the number of claims should always
be understood in relation to an underlying volume vm
> 0. For our purpose, we use the amount
of time, that an insurance company is liable for the risk of the policy, usually measured in years.
We will use the terminology years at risk for the underlying volume vm
. The ratio Nm
/vm
is called
claims frequency with expected claims frequency denoted by
E✓N
m
vm
◆= �
m
.
The second assumption in Definition 4.1 specifies the individual claim sizes. The condition Gm
(0) = 0
implies that the claims Y(m)i
are strictly positive P -a.s. Within this framework we can give more
detailed insight into the expected total claim amount E(Sm
).
Proposition 4.2. Let Sm
be the total claim amount of policyholder m. Then the expected total claim
amount is given by
E(Sm
) = E(Nm
)E⇣Y
(m)1
⌘= �
m
vm
E⇣Y
(m)1
⌘.
Proof. The proof is an application of the tower property for conditional expectation,
E(Sm
) = E
N
mX
i=1
Y(m)i
!= E
E
N
mX
i=1
Y(m)i
�����Nm
!!
= E
N
mX
i=1
E⇣Y
(m)i
���Nm
⌘!= E
N
mX
i=1
E⇣Y
(m)i
⌘!
= E (Nm
)E⇣Y
(m)1
⌘,
where we used the tower property in the second, linearity of the expectation in the third, inde-
pendence of Nm
and⇣Y
(m)i
⌘
i�1in the fourth and the identically distributed property in the last
equality.
![Page 53: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/53.jpg)
4.2. MOTOR INSURANCE COLLISION DATA 45
As mentioned above, a fair premium (without loadings) is seen to be the expected loss E(Sm
)
transferred to the insurance company with respect to the amount of time the insurer was liable for
policy m. We define the pure premium as the expected total claim amount per duration or using
Proposition 4.2,
Pure Premium = E✓Sm
vm
◆= E
✓N
m
vm
◆E⇣Y
(m)1
⌘= �
m
E⇣Y
(m)1
⌘(4.1)
= Expected Claims Frequency · Expected Claim Severity.
Motivated by (4.1), insurance companies set up two separate models: claims frequency and claim
severity. For the remainder, we focus on claims frequency models.
4.2 Motor Insurance Collision Data
The Motor Insurance Collision Data Set contains all information collected by AXA Winterthur on
policies with a fully comprehensive (Vollkasko) car insurance with collision cover included from 2010
until 2014. For the data analysis we focus on collision claims only. In total the data consists of
approximately 450’000 policies. A list of all covariates we consider in this thesis can be found in the
Appendix in Table D.1. We begin with a few comments on the data structure (see Table 4.1 for an
illustration of the structure):
rec pol beg end yr nc ac x1 . . . xp
1 1 01.01.09 12.04.09 0.279 1.0043 3487 a1 · · · b1
2 1 13.04.09 31.12.09 0.721 0 0 a1 · · · b1
3 2 01.01.09 31.12.09 1 0 0 a2 · · · b2
4 3 01.01.09 17.07.09 0.543 0 0 a3 · · · b3
5 3 18.07.09 15.12.09 0.416 0 0 a30 · · · b3
6 3 16.12.09 31.12.09 0.041 1.0043 2676 a30 · · · b3
......
......
...... · · ·
... · · ·...
Table 4.1 – Record structure of the data provided by AXA Winterthur.
AXA generates a new record for every policy at the end of each accounting year or if there has
been a change in the data of the policy. The latter might occur when the client has reported a
claim but also simply when the client has moved to a neighbouring village, i.e. a change in the
zip code (plz). Using this system, we can capture the development of all covariates for each
policy. For example from record 4 to record 5 the covariate x1 changed from a3 to a30.
In order to account for incurred but not reported (IBNR) claims, the missing number of claims
![Page 54: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/54.jpg)
4.2. MOTOR INSURANCE COLLISION DATA 46
is estimated and uniformly added to the number of claims of observations where a claim has
been reported. This explains why the number of claims (nc) is not integer-valued. When fitting
a GLM model with a Poisson assumption (see Ohlsson and Johansson (2010) or Chapter 7.3
in Wuthrich (2015)) for the number of claims, we need to be aware of this fact and adjust the
data accordingly.
If a policyholder encounters two claims on the same day, the dataset has an additional obser-
vation containing the second claim with duration set to zero. This will lead to problems when
calculating the claims frequency. In order to avoid this problem, the duration is manually set
to 0.001.
4.2.1 Data Management
Based on the particular structure of the Motor Insurance Collision Data Set, we need to prepare the
data, before we can perform a statistical analysis. As shown in Table 4.1, every time a covariate has
changed, AXA generates a new record in the database. This might artificially reduce the durations
of the policies and therefore increase their frequencies. Hence we manipulated the data, resulting in
two types of data sets:
Variations of the Data Set:
Data Set A:
1. Identify all relevant covariates considered in the statistical analysis.
2. Merge all records of the same policy where no changes in any relevant covariates have
occurred. For example, in Table 4.1 if the covariate x1 is irrelevant for our purpose, then
record (rec) 4 and 5 can be merged. Analogously, since no covariates have changed also
the records 1 and 2 can be merged.
When merging records the claim amounts (ac), number of claims (nc) and the durations (yr)
need to be added. The changes are only due to data management and no information gets lost.
However, there still remain policies with durations smaller than one. The problem becomes
more severe, if we consider many covariates in step 1 as relevant.
Data Set B:
1. Identify all relevant covariates considered in the statistical analysis.
2. Merge all records of the same policy.
The resulting data set will mostly contain observations with duration one. Only policies that
were cancel or signed during the year or policies which were not active during the whole year
(e.g. vintage cars) do not have a full year at risk.
![Page 55: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/55.jpg)
4.2. MOTOR INSURANCE COLLISION DATA 47
Example 4.3. Figure 4.1 shows the proportions of policies with duration one contained in Data
Sets A and B, when considering the covariates age of driver, nationality, price of car and kilometres
yearly driven as relevant. The blue area denotes the total amount of observations in each data set
(normalized to 100 for Data Set A) and the grey area shows the amount of policies with a full year
at risk. For the analysis we have used the data of one accounting year.
Data Set A Data Set B
020
4060
8010
00
2040
6080
100
Figure 4.1 – Proportions of observations with a full duration according to the variation of data set.
?
4.2.2 Descriptive Statistics
To get an understanding for the claims count distribution introduced in Definition 4.1, we examine a
few descriptive statistics first. In this section, we restrict the data to one accounting year. In Figure
4.2, we see that the number of claims in Data Set A lies between zero and five.
Number of Claims (nc)
Num
ber o
f Obs
erva
tions
0 1 2 3 4 5
0e+0
01e
+05
2e+0
53e
+05
4e+0
5
Number of Claims (nc)
log(
Num
ber o
f Obs
erva
tions
)
0 1 2 3 4 5
01
23
45
Figure 4.2 – Histograms of the number of claims (nc) for Data Set A.
![Page 56: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/56.jpg)
4.2. MOTOR INSURANCE COLLISION DATA 48
From Figure 4.3, we see that the observed frequencies go up to 170 claims per year. Comparing with
Figure 4.2, we conclude that these outliers are due to small durations.
Claims Frequency
Num
ber o
f Obs
erva
tions
0 50 100 150
0e+0
01e
+05
2e+0
53e
+05
4e+0
5
DurationN
umbe
r of O
bser
vatio
ns
0.0 0.2 0.4 0.6 0.8 1.0
0e+0
01e
+05
2e+0
53e
+05
4e+0
5
Figure 4.3 – Histograms of the claims frequency and duration for Data Set A.
The same analysis can be conducted for Data Set B. Due to the exhaustive merging, we expect less
observations with small volumes. A histogram of the observed claims counts and frequencies is shown
in Figure 4.4.
Number of Claims (nc)
Num
ber o
f Obs
erva
tions
0 1 2 3 4 5
0e+0
01e
+05
2e+0
53e
+05
4e+0
5
Claims Frequency
Num
ber o
f Obs
erva
tions
0 10 20 30 40 50 60
0e+0
01e
+05
2e+0
53e
+05
4e+0
5
Figure 4.4 – Histograms of the number of claims (nc) and claims frequency for Data Set B.
Univariate Analysis
In order to better understand the data, we describe a few univariate statistics to get a first impression.
Since we cannot visualize a high-dimensional covariate space, we mainly focus on variables that are
![Page 57: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/57.jpg)
4.2. MOTOR INSURANCE COLLISION DATA 49
likely to have an impact on the expected claims frequency. Some of them are also included in the
pricing model used by AXA Winterthur. Explicitly, we consider the tari↵ criteria age of driver,
nationality, price of car as well as kilometres yearly driven for the univariate analysis.
In order to make sound statistical statements, we always want to use some version of the law of
large numbers based on multiple observation of the same underlying stochastic nature. Unless we
have observed several years of data for each policy, it is di�cult to treat each policyholder separately.
Denote by N(t)i
for t 2 {2010, . . . , 2014} the claims count of policy i in year t with expected claims
frequency �(xi
), where x
i
2 Rp are the covariates of policyholder i. Assuming that the claims
counts of policy i are independent N (t)i
⇠ Poi(�(xi
)v(t)i
), we may use the average frequency over the
observed 5 year time period as a predictor for the expected claims frequency
�(xi
) =1
P2014t=2010 v
(t)i
2014X
t=2010
N(t)i
.
The one standard error confidence bounds for �(xi
) are given by,
E⇣�(x
i
)⌘± V co
⇣�(x
i
)⌘E⇣�(x
i
)⌘= �(x
i
)± V co⇣�(x
i
)⌘�(x
i
). (4.2)
Using the Poisson assumption, the coe�cient of variation is given by
V co⇣�(x
i
)⌘=
1q�(x
i
)P2014
t=2010 v(t)i
.
Considering the Motor Insurance Collision Data Set the expected claims frequencies lie in �(xi
) 2[5%, 25%]. Assuming that the policyholder is a fairly good driver with an expected claims frequency
of 10% and that he was insured over the whole period, i.e. v(t)i
= 1 for all t 2 {2010, . . . , 2014}, weget a fairly large one standard error confidence interval of �(x
i
) 2 [�4%, 24%].
Therefore, instead of treating every policyholder separately, actuaries form subportfolios with
similar risk characteristics. Meaning we partition the covariate space in di↵erent groups such that
we still capture the characteristics of the covariates whilst ensuring enough observations per group
to be able to make statistical inference. Let N i
1, . . . , Ni
m
i
denote the claims counts of policies within
risk class i. Assume that N i
j
⇠ Poi(�i
vj
) and independent for j = 1, . . . ,mi
and for all i. Then the
total number of claims of risk class i is given by
N i :=m
iX
j=1
N i
j
⇠ Poi(�i
m
iX
j=1
vj
), (4.3)
and denote by vi :=P
m
i
j=1 vj the volume of risk class i. Then the coe�cient of variation of the sample
![Page 58: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/58.jpg)
4.2. MOTOR INSURANCE COLLISION DATA 50
claims frequency N i/vi of risk class i is given by
V co
✓N i
vi
◆=
1q�i
Pm
i
j=1 vj.
Again consider a class of fairly good drivers with expected claims frequency of �i
= 10%. In order
to achieve a confidence interval for our estimate �i
= N i/vi 2 [9.5%, 10.5%], we need to solve
|V co(�i
)�i
| < 0.5%.
This reveals that vi > 4000, i.e. we need a volume of 4000 years at risk in risk class i. This shows
how the uncertainty of a predictor can be reduced by forming homogeneous subportfolios but also
that in an insurance context we need rather large portfolios. Table 4.2 shows a summary of the data
and subportfolios that we consider for the univariate analysis.
Variable Name Type Values Description
ac integer � 0 Total claim amount
nc integer N0 Number of claims
yr numerical [0, 1] Duration (years at risk)
age gr categorical 1 - 8 Age of driver encoded in 8 groups
nat gr categorical 1 - 4 Nationality of driver encoded in 4 groups
price gr categorical 1 - 8 Price of the car encoded in 8 groups
km gr categorical 1 - 5 Kilometres yearly driven encoded in 5 groups
Table 4.2 – Response variables and grouped covariates used for univariate analysis.
Note that in general the estimator �i
= N i/vi does not conincide with the sample mean, i.e.
N i
vi=
1Pm
i
j=1 vj
m
iX
j=1
N i
j
=1
mi
m
iX
j=1
miP
m
i
j=1 vjN i
j
6= 1
mi
m
iX
j=1
N i
j
vj
.
Due to outliers in the observed claims frequencies N i
j
/vj
caused by small durations (see Figure 4.3),
the use of the sample mean for the expected claims frequency is questionable.
Lemma 4.4 (Estimator for expected claim frequency). For j = 1, . . . ,mi
, let N i
j
denote the number
of claims of policy j in risk class i with expected claim frequency E(N i
j
/vj
) = �i
. Then
�i
=1P
m
i
j=1 vj
m
iX
j=1
N i
j
(4.4)
![Page 59: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/59.jpg)
4.2. MOTOR INSURANCE COLLISION DATA 51
is an unbiased estimator for �i
.
The proof follows by a direct calculation. Note that (4.4) balances small durations vj
by first
summing them up. This results in an estimator, which is robust against outliers caused by small
volumes vj
. Alternatively, we can assume that the policies within each risk group have independent
Poisson-distributed claims counts N i
j
with the same expected claims frequency, then the estimator
(4.4) is also the maximum likelihood estimator.
Lemma 4.5. Let N i
1, . . . , Ni
m
i
be independent and Poi(�i
vj
)-distributed for �i
> 0. Then the max-
imum likelihood estimator for �i
is given by
�i
=1P
m
i
j=1 vj
m
iX
j=1
N i
j
.
The proof for Lemma 4.5 follows from Example 2.3. Having introduced the concept of risk classes,
we did not yet explain how these groups are formed. In practice, actuaries determine these risk
groups exogenously keeping in mind the trade-o↵ between capturing the covariates’ characteristics
and the total volume of the group. Alternatively, we can use the tree construction algorithm to fit a
univariate tree which does not require an a priori specification of the risk groups. The resulting leaves
can be interpreted as risk classes. For the univariate analysis below, we consider both risk groups
used by AXA and risk groups obtained by the use of the tree algorithm. More details on determining
risk groups from regression trees can be found in Section 5.4. To get a better understanding of the
data, we show for each tari↵ criterion and risk group the observed average claims frequency using
the estimator (4.4). Additionally, we also provide error whiskers of plus minus one standard error.
The standard errors were derived assuming the claims counts N i
j
within risk group i are independent
Poi(�i
vj
)-distributed. It follows that,
V ar(�i
) =1
⇣Pj
vj
⌘2X
j
V ar�N i
j
�=
1⇣P
j
vj
⌘2X
j
�i
vj
=�iPj
vj
,
where we used the independence in the first equality and the Poisson assumption in the second.
Therefore the one standard error whiskers are given by
�i
±
s�iPj
vj
.
Remark 4.6. Using the central limit theorem (or directly by (4.3)), we can quantify the probability,
that the predictor �i
lies in the one standard error confidence bounds (4.2),
P⇣�i
2h�i
� V co⇣�i
⌘�i
,�i
+ V co⇣�i
⌘�i
i⌘⇡ 70%.
![Page 60: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/60.jpg)
4.2. MOTOR INSURANCE COLLISION DATA 52
The proof can be found in Appendix A.5. J
Tari↵ Criterion 1: Age of Driver
First we consider the tari↵ criterion age of driver. The risk groups chosen by AXA are shown in the
left panel of Figure 4.5. The blue bars denote the average claims frequency within each risk groups,
whereas the grey bars denote the total volume of each group. Figure 4.5 shows that there are large
di↵erences in the claims frequency with respect to age of driver. This indicates that the covariate
plays an important role in a statistical analysis of the expected claims frequency. This result is not
very surprising, since we expect younger drivers to be less experienced and therefore more likely
to be involved in a car accident. We might expect a similar behavior for the older drivers, due to
decreasing reaction speed or loss of good eyesight. The reason why we do not see this increase in
Figure 4.5a, is explained by the fact that the last risk group contains a rather large range of di↵erent
age classes. This large age group contains roughly 15% of the whole portfolio. On the other hand we
might not be capturing the true dependency of the response on the covariate age of driver. We have
repeated the above analysis with a refinement of the last age group. Figure 4.5b shows the expected
U-shape for the modified groups. The third plot shows the results obtained when applying the tree
algorithm for choosing the optimal risk groups. Although less groups are chosen, the tree captures
the same characteristics. Overall the frequencies of the subportfolios vary between roughly 9% and
19% (based on the risk groups obtained by the tree predictor), supporting the relevance of age of
driver when modeling the expected claims frequency of a policyholder.
Age Group
Freq
uenc
y
18−2
021−2
526−3
031−4
546−5
556−6
061−6
566−9
80.00
0.05
0.10
0.15
0.20
0.25
0.00
0.05
0.10
0.15
0.20
0.25
010
2030
Volu
me
(in %
)
(a) Age Groups (AXA)
Age Group
Freq
uenc
y
0.00
0.05
0.10
0.15
0.20
0.25
18−2
021−2
526−3
031−4
546−5
556−6
061−6
566−7
071−8
080−9
90.00
0.05
0.10
0.15
0.20
0.25
010
2030
Volu
me
(in %
)
(b) Age Groups (customized)
Age Group
Freq
uenc
y
0.00
0.05
0.10
0.15
0.20
0.25
18−2
223−2
425−2
627−3
435−5
051−7
172−7
778−8
1>8
10.00
0.05
0.10
0.15
0.20
0.25
010
2030
Volu
me
(in %
)
(c) Age Group (Tree)
Figure 4.5 – Bar plots showing the average claims frequency per age group (in blue) obtained from theMotor Insurance Collision Data Set. The left plot shows the age groups chosen by AXA, the middleplot the refined risk groups and the right plot the risk groups chosen by the tree predictor. The dottedhorizontal lines show the overall portfolio averages. Additionally to the average frequency, the grey barsshow the total volumes of the risk groups.
![Page 61: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/61.jpg)
4.2. MOTOR INSURANCE COLLISION DATA 53
Tari↵ Criterion 2: Nationality
The second tari↵ criterion we consider is nationality. In total there are around 160 di↵erent na-
tionalities within the Motor Insurance Collision Data Set. However among them are many exotic
nationalities, e.g. Macao, that contain only very few observations resulting in a total volume of less
than one year at risk. Hence the need to create reasonable risk groups is even more present for these
kinds of covariates. AXA has created 4 risk groups, shown in the left panel of Figure 4.6. In the
1 2 3 4
Nationality Group
Freq
uenc
y
0.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
0.15
0.20
020
4060
80
Volu
me
(in %
)
(a) Nationality Groups (AXA)
1 2 3 4
Nationality Group
Freq
uenc
y
0.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
0.15
0.20
020
4060
80
Volu
me
(in %
)
(b) Nationality Groups (Tree)
Figure 4.6 – Bar plots showing the average claims frequency per nationality group (in blue) obtainedfrom the Motor Insurance Collision Data Set. The dotted horizontal lines show the overall portfolioaverages. Additionally to the average frequency, the grey bars show the total volumes of the risk groups.
right panel of Figure 4.6 we see the risk groups chosen by the tree predictor. In both cases, a large
group containing the low risk nationalities is chosen. Then there are two groups for medium risk
nationalities and one group for the high risks. Note that for the nationality analysis, we have used a
larger data set in order to have more observations for exotic countries. This is also the reason why
the overall portfolio average in Figure 4.6 does not exactly coincide with the averages shown for the
other tari↵ criteria. Overall the observed frequencies of the subportfolio vary between 11% and 20%
indicating an important role for predicting the claims frequency.
Tari↵ Criterion 3: Price of Car
The third tari↵ criterion is price of car. The risk groups chosen by AXA are shown in the left panel
of Figure 4.7. We observe an approximately linear increase of the frequency in the covariate price of
car. A similar structure is also captured by the corresponding univariate tree predictor. However, the
tree adds an additional split for very expensive cars above 175’000 CHF. This suggests that drivers
![Page 62: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/62.jpg)
4.2. MOTOR INSURANCE COLLISION DATA 54
with a car, whose price exceeds a certain threshold start to drive more carefully. Another explanation
might be that these are cars, e.g. vintage cars, which are only driven at the weekends, i.e. there
is a correlation between price and another covariate such as kilometres yearly driven. Comparing
the risk groups chosen by AXA and by the tree algorithm, AXA’s risk groups are more equally
sized. However there is only a small di↵erence in the average frequency between the groups 20-25
and 25-30. Looking at the tree predictor, we see that there exists only one risk groups from 21-32,
which is basically a combination of the two mentioned before. This suggests that we can reduce the
number of risk groups whilst still capturing the underlying structure. Overall the observed claims
frequencies of the di↵erent subportfolios vary between 8% and 17%, which is again a clear indication
for an important covariate with respect to predicting the claims frequency.
<=20 20−25 25−30 30−35 35−40 40−50 50−70 >70
Price Group (in 1'000 CHF)
Freq
uenc
y
0.00
0.05
0.10
0.15
0.00
0.05
0.10
0.15
010
2030
Volu
me
(in %
)
(a) Price Groups (AXA)
Price Group (in 1'000 CHF)
Freq
uenc
y
<18
18−2
121−3
233−3
940−6
162−8
1
82−1
75>1
75
0.00
0.05
0.10
0.15
0.00
0.05
0.10
0.15
010
2030
Volu
me
(in %
)
(b) Price Groups (Tree)
Figure 4.7 – Bar plots showing the average claims frequency per price group (in blue) obtained fromthe Motor Insurance Collision Data Set. The dotted horizontal lines show the overall portfolio averages.Additionally to the average frequency, the grey bars show the total volumes of the risk groups.
Tari↵ Criterion 4: Kilometres Yearly Driven
The last tari↵ criterion we consider in the univariate analysis is kilometres yearly driven. The risk
groups chosen by AXA are shown in the left panel of Figure 4.8. Very similar to the price covariate,
we observe an increase of the average claims frequency in the amount of kilometres yearly driven.
This is in line with the intuition that when driving larger distances per year also the exposure to
possible accident events increases and therefore the likelihood of a claim. The chosen risk groups by
AXA are supported by the tree predictor. The only di↵erence is that the tree does not split the risk
group 6.5 to 14 into two separate groups as it is done by AXA. Overall the observed frequencies vary
between roughly 8% and 13%. Although there seems to be a di↵erence between the subportfolio,
![Page 63: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/63.jpg)
4.2. MOTOR INSURANCE COLLISION DATA 55
the magnitude is less distinct as for the criteria previously considered. This may indicate that the
criterion is not as important for the claims frequency analysis.
<7 7−10 10−15 15−20 >20
KM Group (in 1'000 km)
Freq
uenc
y
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
010
2030
40
Volu
me
(in %
)
(a) KM Groups (AXA)
<6.5 6.5−14 14−18 >18
KM Group (in 1'000 km) Fr
eque
ncy
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.14
010
2030
40
Volu
me
(in %
)
(b) KM Groups (Tree)
Figure 4.8 – Bar plots showing the average claims frequency per KM group (in blue) obtained fromthe Motor Insurance Collision Data Set. The dotted horizontal lines show the overall portfolio averages.Additionally to the average frequency, the grey bars show the total volumes of the risk groups.
4.2.3 Multivariate Analysis
Ultimately, our goal is to have a multivariate model, taking dependencies of the covariates into
account, e.g. kilometres yearly driven and expensive cars. In Figure 4.9, the lower diagonal part
shows a scatter plot of the continuous covariates age of driver, price of car and kilometres yearly
driven. We observe a decrease in the kilometres yearly driven for cars with a price above 40’000
CHF. The upper diagonal shows the correlation coe�cients between the covariates. The correlations
do not show any sign of dependence between the covariates. On the diagonal, we see a histogram for
each covariate. For age of driver, we observed a normally-distributed shape. The covariates price of
car and kilometres yearly driven are plotted on a log-scale, hence reveal a heavily skewed shape with
a few very expensive car resp. drivers with a lot of driving kilometres per year.
![Page 64: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/64.jpg)
4.3. SIMULATED INSURANCE CLAIMS COUNT DATA 56
Figure 4.9 – Scatter and correlation plot for the covariates age of driver, price of car and kilometresyearly driven. The figure is based on a random subsample of 10% of the whole data set.
4.3 Simulated Insurance Claims Count Data
Besides the Motor Insurance Collision Data Set we will also use simulated data sets. This has the
advantage that we can artificially manipulate certain characteristics in order to identify the key issue
that are important when fitting a tree predictor to an insurance data set. The simulated data tries
to recover the characteristics discovered in the previous section and depends on two covariates: age
of driver and price of car. More precisely, we have chosen the following model for the frequencies:
�(age, price) =8�0
100
✓5 +
(age� 50)2
96+
5price
36
◆, (4.5)
where �0 denotes a scaling factor which controls the overall portfolio average. Figure 4.10 shows the
model equation (4.5) for �0 = 0.1 and age 2 [18, 90] resp. price 2 [15, 85]. The modeled frequencies
![Page 65: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/65.jpg)
4.3. SIMULATED INSURANCE CLAIMS COUNT DATA 57
lie in �(age, price) 2 [5.6%, 26.7%].
0.06
0.08
0.1
0.12
0.14
0.17
0.19
0.21
0.23
0.25
0.27
20 30 40 50 60 70 80 90
2030
4050
6070
8090
Age
Price
Frequency Model for Simulated Data
(a) Bivariate view
20 30 40 50 60 70 80 90
0.01
00.
016
0.02
2
Frequency Model (Univariate)
Age of Driver
λ(ag
e, 5
0)
20 40 60 800.
006
0.01
00.
014
Price of Car
λ(50
, pric
e)
(b) Univariate views
Figure 4.10 – Illustration of claims frequencies obtained by model equation (4.5) with �0 = 0.1.
4.3.1 Negative-Binomial Distribution
In order to simulate from model equation (4.5), we need to specify the claims count distribution.
The Poisson distribution has the special characteristic that its expected value equals the variance,
i.e. E(N) = V ar(N). Unfortunately, this is usually not supported by real data such as the Motor
Insurance Collision Data Set. Insurance claims data rather suggests that E(N) < V ar(N). In order
to account for this issue, we introduce the negative-binomial distribution.
Definition 4.7 (Negative-Binomial Distribution). N has a negative-binomial distribution, write
N ⇠ NegBin(�v, �), with volume v > 0, expected claims frequency � > 0 and dispersion parameter
� > 0, if
1. � ⇠ �(�, �),
2. conditionally, given �, N ⇠ Poi(��v).
The idea of Definition 4.7 is to attach a source of uncertainty to the claims frequency �. This
additional volatility leads to the desired property,
V ar(N) = E(V ar(N |�)) + V ar(E(N |�)) = v�+ (v�)2/� > �v = E(N).
![Page 66: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/66.jpg)
4.3. SIMULATED INSURANCE CLAIMS COUNT DATA 58
For the simulation studies in the following chapters, we will use equation (4.5) to model negative-
binomially distributed claims data, where we choose the dispersion parameter as well as the overall
portfolio average �0 from the Motor Insurance Collision Data Set. The choice of parameters is further
discussed in Section 4.3.2.
Remark 4.8. Note there is an important di↵erence between the Poisson-Gamma Model (Definition
3.11) and the Negative-Binomial Model. Assume that N1, . . . , Nm
are independent negative-binomial
with Ni
⇠ NegBin(�(xi
)vi
, �), which means that each Ni
belongs to another independent latent
factor � ⇠ �(�, �). Whereas for the Poisson-Gamma Model, the � is identical for each Ni
and the
Ni
are only conditionally independent. J
In order to simulate claims counts from model equation (4.5), we have implemented a data generator,
which proceeds in the following steps:
Algorithm 4.1 Generating Data for Simulation
Inputs: Sample size n, dispersion parameter �, average overall portfolio frequency �0.
1. Generate independent covariate matrix (age, price) 2 R2⇥n.
2. Calculate expected claims frequency �(age, price) according to model equation (4.5).
for i = 1, . . . , n do
Generate � ⇠ �(�, �).
Generate Ni
⇠ Poi(�((age, price)i
)�), where (age, price)i
is the i-th row of (age, price).
end for
For simplicity, the durations vi
are set to one for all i. Note that the covariates are also simulated.
For the simulations studies we will use the following setting:
Assumption 4.9 (Simulation Study Data Set).
Number of observations n = 2500000;
Overall claims frequency �0 = 0.1;
Covariate matrix (age, price) simulated with (age, price)i
⇠ P , where P denotes the empirical
distribution of the covariates age and price in the Motor Insurance Collision Data Set;
Claims count distribution Ni
⇠ NegBin(�((age, price)i
), �), where �(·) is given by model equa-
tion (4.5) and � = 1.5, see Section 4.3.2.
4.3.2 Choice of Parameter
The simulated data set contains the two parameters �0 and �. Therefore we will quickly discuss how
we choose them for the simulated data according to Assumption 4.9.
![Page 67: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/67.jpg)
4.3. SIMULATED INSURANCE CLAIMS COUNT DATA 59
Dispersion Parameter for Negative-Binomial Model:
In order to simulated negative-binomially distributed claims counts we need to estimate the dispersion
parameter �. Assume that N1, . . . , Nn
are independent with Ni
⇠ NegBin(�(xi
)vi
, �). We describe
a maximum likelihood based approach to determine �, see Section 2.4. The derivative of the log-
likelihood function for the claims counts N1, . . . , Nn
is given by
@
@�logL(�) = @
@�
nX
i=1
log
✓N
i
+ � � 1
Ni
◆+ � log(1� p
i
) +Ni
log pi
= 0, (4.6)
where pi
= �(xi
)/(�+�(xi
)vi
). The proof of equation (4.6) is based on the Polya parametrization of
the negative-binomial distribution and can be found in Chapter 2 of Wuthrich (2015). Unfortunately,
to solve this equation, we need to know the true �(xi
) for i = 1, . . . , n. However in practice, we
do not know the true underlying model equation. In fact, this is rather our ultimate goal, which
we want to predict. In order to derive an estimate for � from the Motor Insurance Collision Data
Set, we may choose �(xi
) ⌘P
n
i=1Ni
/P
n
i=1 vi as the overall portfolio average. Then we can solve
equation (4.6) numerically, resulting in an estimate of � ⇡ 1.08. However this estimate is likely to
be too small, since it does not reflect the structural di↵erences. Therefore, we estimate � again on
the subset of policies with age 2 [40, 80] and price 2 [400000, 600000]. This is the largest risk class
chosen by the bivariate regression tree on the Motor Insurance Collision Data Set, see Section 5.5.
This procedure takes the underlying structure into account such that the additional variability in the
data only comes from the overdispersion parameter. This results in an estimate of � ⇡ 1.5. Using
that
V co
✓N
i
vi
◆=
s1
vi
�(xi
)+
1
�,
as well as �(xi
) 2 [5.6%, 26.7%], we get a coe�cient of variation of between 2.1 and 4.3 for policies
with a full duration.
Overall Portfolio Average �0:
The scaling of model equation (4.5) is chosen such that when simulating from Assumption 4.9 , the
overall portfolio average1Pn
i=1 vi
nX
i=1
Ni
is roughly equal to �0. For the Motor Insurance Collision Data Set we observe an overall portfolio
frequency close to 10%, hence we choose �0 = 0.1.
![Page 68: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/68.jpg)
Chapter 5
Data Analysis: Tree Model
In the present chapter, we fit the tree model introduced in Chapter 3 to insurance claims data. We
focus on the issue of predicting the expected claims frequency of a policyholder. First we use a
simulated data set and in a second step the Motor Insurance Collision Data Set. The data analysis
was conducted using the programming language R.
5.1 Fitting a Tree Model in R
The package rpart provides an e�cient implementation of the recursive binary partitioning introduced
in Section 3.1.2. For growing a tree, we use the function rpart. For illustration, we first present a
detailed example of the functionality of the rpart package:
Step 1: Growing a tree
Given a response y = N/v and covariates x1, . . . , xp, we can fit a tree using the rpart function as
follows:
tree_anova <- rpart(y~x1+...+xp, data = data, method = "anova",
weights, control=rpart.control(minbucket, usesurrogate, cp))
tree_poisson <- rpart(cbind(v, N)~x1+...+xp, data = data, method = "poisson",
parms = list(shrink = k),
control = rpart.control(minbucket, usesurrogate, cp))
Comments:
The input method specifies the type of tree. Options are: anova for classical sum of squares
regression (see Section 3.1.2), poisson for Poisson regression trees (see Section 3.2.1) and class
for Gini-index based classification trees (see Section 3.2.3).
60
![Page 69: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/69.jpg)
5.1. FITTING A TREE MODEL IN R 61
The optional input weights can be used in combination with the method anova to obtain a
weighted anova tree (see Section 3.2.2).
minbucket sets the minimum number of observation to be contained in a leaf.
cp denotes a factor, which specifies the minimum improvement of the overall fit such that a split
will be attempted. See Remark 3.9 and Step 2 below for more details. Choosing cp = 0 grows
a tree as large as possible, until no further split is possible. However choosing a reasonable
cp-value may save a large amount of computing time.
The input usesurrogate controls the use of surrogate variables (see Section 3.3.2).
For the method poisson, we can hand over a shrinkage parameter k, which denotes the coe�cient
of variation of the prior distribution in the Poisson-Gamma model, see Section 3.2.1 as well as
Section 5.3.1.
For more details we refer to Therneau et al. (2015).
Step 2: Extracting the optimal pruning parameter
Before we can proceed with the pruning process, we need to choose an optimal tuning parameter
↵ (see Section 3.1.3). This is usually done by K-fold CV (see Section 2.3.1). Recall from Section
3.1.3, there exists an increasing sequence ↵k
of optimal pruning parameters each corresponding to an
optimally pruned subtree T (↵k
). Figure 5.1 shows the cross-validation errors of the optimally pruned
subtrees T (↵) as a function of the cost-complexity parameter cp, see Remark 3.8. Figure 5.1a shows
the cross-validation error (2.7), whereas Figure 5.1b shows the cross-validation error normalized by
R(T (1)), i.e. the error of the root tree, denoted by X-val Relative Error. This corresponds to the
output of the build-in function plotcp. As we show in the next sections, choosing the optimal ↵ resp.
cp is a delicate issue. The most common selection criteria are:
Minimum Cross-Validation Error: Selecting the subtree with minimal cross-validation error,
see black vertical line in Figure 5.1a.
1-Standard Error (1se) Rule: In practice, one often encounters a U-shaped cross-validation
error plot with a flat valley around the minimum. Breiman et al. (1984) suggested to choose
the smallest subtree within one standard error of the subtree with minimal cross-validation
error. This results in a tree with only slightly larger error, but possibly significant reduction
in complexity of the model, see red vertical line in Figure 5.1a.
Elbow Criterion: Negassa et al. (2000) suggested to choose the subtree corresponding to the
point beyond which the cross-validation error stops to decrease rapidly. In Figure 5.1a the
elbow criterion coincides with the 1se rule.
![Page 70: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/70.jpg)
5.2. REGRESSION TREE FOR CLAIMS FREQUENCY 62
●
●
●
●● ●
● ●●
● ●
3500
4000
4500
5000
5500
6000
cp
Cro
ss−V
alid
atio
n Er
ror
0.29154 0.0566 0.02031 0.01282 0.01
1 2 3 4 5 6 7 9 10 11 12
Number of Leaves
1seMin CV
(a) Output: Customized CV
●
●
●
● ● ●
● ●●
● ●
cp
X−va
l Rel
ative
Erro
r
0.5
0.6
0.7
0.8
0.9
1.0
1.1
Inf 0.069 0.025 0.015 0.012 0.01
1 2 3 4 5 6 7 9 10 11 12
size of tree
(b) Output: plotcp()
Figure 5.1 – The figure shows the 10-fold cross-validation error (2.7) as a function of the cost-complexityparameter cp. The vertical lines denote possible choices of the cp parameter according to di↵erent selectioncriteria. The vertical error bars show the standard errors (2.8). On the left, we see the actual cross-validation error, whereas on the right we see the error normalized by the error of the root tree. The ploton the right-hand side can be obtained by the function plotcp provided within the rpart package.
Step 3: Pruning
Having chosen the optimal cost-complexity parameter cp resp. ↵, we need to prune to the optimal
subtree T (↵), as described in Section 3.1.3. The rpart package provides the function prune for this
task. The following command returns the desired optimally pruned subtree.
tree_pruned <- prune(tree, cp = cp_opt)
5.2 Regression Tree for Claims Frequency
In order to analyze the performance of the tree method within the context of insurance claims data,
we proceed in the following steps:
1. Simulated Data Set: We start our analysis on a generated data set in Section 5.3. This
has the advantage, that we know the true underlying model, i.e. when manipulating the
parameters, we can compare the fitted to the true values.
2. Univariate Trees: When applying the method to the Motor Insurance Collision Data Set,
we start with univariate regression trees in Section 5.4. This allows us to understand the splits
![Page 71: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/71.jpg)
5.3. REGRESSION TREE FOR SIMULATED CLAIMS FREQUENCY 63
in more details and we can use the results to back-test the risk classes introduced in Section
4.2.2.
3. Bivariate Trees: In Section 5.5, we slightly increase the complexity by using two covariates.
For two covariates, we can still understand the interactions and are able to visualize the results.
4. Multivariate Trees: Finally, in Section 5.6 we consider a fully multivariate model, taking
into account all covariates recorded.
Given the volume v > 0, we model the response (2.1) by
Y =N
v= f(X) + ",
where N denotes the claims count. Given the volume vi
> 0, we define the frequency of the i-th
policy byN
i
vi
=N
vi
���X=x
i
= f(xi
) + "i
.
Let N1, . . . , Nn
denote the claims counts for each policy, v1, . . . , vn their respective durations and
expected claims frequencies for i = 1, . . . , n given by
�(xi
) = E✓N
i
vi
◆= E
✓N
vi
����X = x
i
◆= f(x
i
).
In the context of regression trees, our predictor is of the form
�(xi
) = fn
(xi
) =MX
m=1
�m
1{xi
2Rm
},
where Rm
, m � 1 denote the leaves of the resulting tree and �m
the constant in each leaf. In order
to benchmark the results, we compare them to a standard GLM model with a Poisson assumption
(see Ohlsson and Johansson (2010) and Chapter 7, Wuthrich (2015)).
5.3 Regression Tree for Simulated Claims Frequency
In a first step, we analyze the tree method on a simulated data set. This has the advantage that we
can artificially manipulate certain characteristics in order to identify the key issue that are important
when fitting a tree predictor to an insurance data set. The simulated data introduced in Section
4.3 tries to recover the characteristics of the Motor Insurance Collision Data Set and involves two
covariates age of driver and price of car. When fitting a regression tree based on the weighted sum
of squares option, we use the weights wi
= vi
. This results in the model introduced in Example 3.17.
When simulating from Assumption 4.9, all observations have a volume of one year at risk. Hence
the MLE estimator of the methods Anova, Weighted Anova and Poisson coincide. For the Weighted
![Page 72: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/72.jpg)
5.3. REGRESSION TREE FOR SIMULATED CLAIMS FREQUENCY 64
Anova and Anova also the minimization criteria coincide, therefore they results in the same tree.
Simulating from Assumption 4.9, the claims counts are conditionally Poisson-distributed, therefore
we choose the Poisson method to demonstrate the results.
5.3.1 Choice of Parameter for Poisson Tree Predictor
For the Poisson tree predictor we have two additional parameters, which need to be estimated from
the Motor Insurance Collision Data Set.
Minimum Leaf Size:
In Section 4.2.2, we have seen that for Poisson-distributed claims counts, we need a total of 4000
years at risk within a risk class to obtain su�ciently small confidence bounds, i.e. standard errors
smaller than ±0.5%. Choosing the minimum leaf size su�ciently large may also reduce the variability
of the predictor, see Section 3.4.2. Since each leaf of a tree predictor denotes a risk class, we set the
minimum leaf size parameter minbucket = 4000. Therefore the minimum volume in a leaf is equal
or slightly smaller than 4000 years at risk, if there are observations without a full duration.
Shrinkage Parameter for Poisson-Gamma Model:
The Poisson method contains the shrinkage parameter k = V co(⇤), which denotes the coe�cient of
variation in the Poisson-Gamma Model assumed in each leaf, see Section 3.2.1. Again choosing k
from a real world data set is a delicate issue. If we assume that in each leaf l, the claims counts
N1, . . . , Nm
l
are described by a Poisson-Gamma model, then we would choose a di↵erent k for each
leaf. However, we do not know the leaves in advanced, instead we use the tree to determine these.
Therefore the algorithm is designed to choose the same k for all leaves. In order to obtain an estimate
for k, we consider the largest risk group chosen by the bivariate regression tree (see Section 5.5) with
age 2 [40, 80] and price 2 [400000, 600000]. Let N1, . . . , Nm
be the claims counts of policies in this risk
class. Only considering one leaf removes the structural di↵erences between the leaves. Therefore it is
reasonable to assume that all observations have the same expected frequency. Recall from Definition
3.11, conditionally given ⇤, N1, . . . , Nm
are independent with Ni
⇠ Poi(⇤vi
) with prior distribution
⇤ ⇠ �(�, c). The parameter c is calibrated using �/c = E(⇤) = �0, where �0 is estimated by the
average frequency within the risk class, i.e.
�0 =1Pm
i=1 vi
mX
i=1
Ni
.
To estimate the parameter �, note that
E(Ni
) = E(E(Ni
|⇤)) = E(⇤vi
) = �0vi,
![Page 73: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/73.jpg)
5.3. REGRESSION TREE FOR SIMULATED CLAIMS FREQUENCY 65
and since V ar(⇤) = �/c2 = �20/� we have that,
V ar(Ni
) = V ar(E(Ni
|⇤)) + E(V ar(Ni
|⇤)) = v2i
�20
�+ v
i
�0.
Therefore the coe�cient of variation for Ni
is given by
V co(Ni
) =
r1
vi
�0+
1
�.
In order to estimate �, we restrict the data to observations with a full duration, i.e. vi
= 1 and use
the method of moments estimator for V co(Ni
) given by
dV co(Ni
) =bsd(N1, . . . , Nm
)
µ(N1, . . . , Nm
)=
r1
�0+
1
�,
where µ denote the sample mean, bsd the sample standard deviation. Solving for � results in an
estimate � ⇡ 1.37 and therefore a shrinkage parameter of k = 1/p� ⇡ 0.87.
Note that we have set the minimum leaf size to 4000 observations. Assuming for simplicity that
vi
= 1 for all observation in a leaf with average frequency �0 = 10%. By equation (3.10), the
credibility weight ↵ is given by
↵(k) =4000
10.1k2 + 4000
.
Figure 5.2 shows the credibility weights as a function of the coe�cient of variation. We observe that
↵(k) converges to 1 very quickly, meaning that we mainly believe in the observation based estimator.
We conclude that the parameter is only of minor importance regarding the outcome of the predictor.
It also justifies the use of the same k for every leaf.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
k
α(k)
Figure 5.2 – Credibility weights ↵(k) as a function of the coe�cient of variation.
![Page 74: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/74.jpg)
5.3. REGRESSION TREE FOR SIMULATED CLAIMS FREQUENCY 66
5.3.2 Stratified Cross-Validation
Classically the optimal tree size is obtained by cross-validation, see Step 2 in Section 5.1. In this
section, we illustrate several issue associated to the use of cross-validation for insurance claims data
and propose stratified cross-validation as a possible solution.
Breiman et al. (1984) suggest to use the one standard error (1se) rule instead of the minimal
cross-validation error subtree for the choice of tree size. However Negassa et al. (2000) have explicitly
studied tree size selection criteria for Poisson regression trees and observed that the standard errors
of the cross-validation results are often very large, leading to a pessimistic interpretation of the one
standard error rule. They proposed the elbow criterion as a possible solution. In order to analyze the
situation for insurance claims data, we have generated a data set using Assumption 4.9. In Figure 5.3
the output of the standard plotcp function is shown. Note that we observe only a small improvement
of slightly more than 1% from the root tree to the minimum cross-validation error tree in terms of
cross-validation error. Additionally, we observe that the standard errors for the cross-validation folds
are fairly large. Similar to the findings of Negassa et al. (2000), this causes the 1se rule to choose a
small tree. In our case the 1se rule results in a tree with only 4 leaves in contrast to 56 leaves for
the minimum cross-validation error (Min CV) tree, indicated by the red resp. black vertical lines in
Figure 5.3. The elbow criterion on the other hand chooses a tree with 9 leaves. Summing up, there
●
●
●
●
●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●
cp
Rel
ative
Cro
ss−V
alid
atio
n Er
ror
0.98
00.
990
1.00
01.
010
1 8 17 27 41 54 65 79 89 102 116Number of Leaves
Inf 0.00011 4e−05 2e−05 1.2e−05 5.6e−06
1seElbowMin CV
Figure 5.3 – Cross-validation results obtained by the plotcp function.
are two main issues associated to the cross-validation results in Figure 5.3:
1. Why are the relative improvements to the mean model (i.e. root tree) only of the magnitude
of roughly 1%?
2. Why are the standard errors of the cross-validation errors so large?
![Page 75: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/75.jpg)
5.3. REGRESSION TREE FOR SIMULATED CLAIMS FREQUENCY 67
The answer to both questions is related to the unbalancedness of the data. Figure 5.4 shows a
histogram of the observed frequencies of the generated data set. We see that the data is unbalanced
in the sense that the vast majority of observations are zero due to the low expected frequency.
Considering the mean squared error as a measure of goodness, then by Proposition 2.2, we have the
following decomposition
E⇣(Y � Y )2
⌘= E
✓Bias
⇣f(X,L)
⌘2◆+ E
⇣Var
⇣f(X,L)
⌘⌘+Var("),
which is estimated by the cross-validation error (see Section 2.3.1). Choosing the optimal model, we
can only improve the first two terms. However for unbalanced data, the variability of the residuals
" = Y � f(X) is dominant. For the data shown in Figure 5.4, we have calculated the residuals and
used the sample variance to estimate the third term. We observe that the sample variance of the
residuals is roughly 99% of the cross-validation error. Therefore it dominates the other two terms,
which implies that we can only expect to see small di↵erences relative to the root tree.
Observed Claims Frequency
Num
ber o
f Obs
erva
tions
0 1 2 3 4 5 6 7
050
000
1500
00
Figure 5.4 – Histogram of unbalanced data.
The second question could be ignored when using the elbow criterion. However the subjectivity
associated to the elbow approach is not very satisfactory. For unbalanced data, Breiman et al.
(1984) suggested the use of stratified cross-validation for estimating the generalization error. Since
due to a small �0, the majority of policies have zero claims, when dividing the data randomly into K
equally sized subsets, it may occur that one of the folds contains a disproportional amount of large
frequencies. This sample will inflate the sum of squares resp. the deviance statistics, which eventually
is shown in a large standard deviation. The idea of stratified sampling for the cross-validation folds
is to first order the data according to the observed frequencies. Next form bins containing the K
smallest frequencies, then the K next smallest and so on. In the end, we draw from these bins
without replacement resulting in K partitions of the data which all have a similar distribution of the
observed frequencies, see Algorithm 5.1.
![Page 76: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/76.jpg)
5.3. REGRESSION TREE FOR SIMULATED CLAIMS FREQUENCY 68
Algorithm 5.1 Stratified sampling
Input: Learning data set L = {(x1, N1, v1), . . . , (xn
, Nn
, vn
)}.Output: Stratified cross-validation folds (B
k
)k=1,...,K .
1. Order data according to a stratification variable, for us the claims frequency yi
= Ni
/vi
,resulting in a permutation ⇡(i) for i = 1, . . . , n.
2. Define the index sets
Bk
:= {i 2 {1, . . . , n} : ⇡(i)� k mod K = 0},
where mod denotes division with remainder.
3. SetBk
:= {(xi
, Ni
, vi
) 2 L : i 2 Bk
}.
Remark 5.1. A close inspection of the source code of the rpart-package reveals that stratified
sampling is actually already implemented. The code translated into R looks as follows:
data <- data[order(data$y),] #order data according to
#response y
xgroups <- rep(1:K, length = nrow(data)) #Calculate the K different folds
#for cross-validation
Unfortunately, when using the Poisson method, the response variable is a (n⇥ 2)-matrix, where the
first column contains the volumes vi
and the second the observed claims counts Ni
. The implement-
ation orders the data according to the first column, which does not lead to our understanding of
stratified sampling. J
Due to Remark 5.1, we have implemented the cross-validation procedure separately. The R code
can be found in the Appendix C.2. Figure 5.5 shows the results obtained for both classical and
stratified cross-validation. We see that stratification significantly reduces the variance, whereas the
cross-validation error itself remains fairly stable. The reduction in variance implies that the chosen
tree size becomes larger when using the 1se rule.
5.3.3 Optimal Tree Size
Since we use a simulated data set, we know the true underlying model equation (4.5) and can therefore
compute the exact estimation error of the obtained tree predictors. This enables us to compute the
optimal tree size and compare this to di↵erent selection criteria.
![Page 77: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/77.jpg)
5.3. REGRESSION TREE FOR SIMULATED CLAIMS FREQUENCY 69
●
●
●
●
●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
1185
011
950
1205
012
150
cp
Cro
ss−V
alid
atio
n Er
ror
0.00772 0.00011 5e−05 3e−05 2e−05 1e−05 1e−05 1e−05 0
1 6 12 19 26 37 44 51 61 68 78 85 92 99 107 116
Number of Leaves
1seElbowMin CV
(a) Stratified 10-Fold Cross-Validation
●
●
●
●
●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●
1185
011
950
1205
012
150
cp
Cro
ss−V
alid
atio
n Er
ror
0.00772 0.00011 5e−05 3e−05 2e−05 1e−05 1e−05 1e−05 0
1 6 12 19 26 37 44 51 61 68 78 85 92 99 107 116
Number of Leaves
1seElbowMin CV
(b) Classical 10-Fold Cross-Validation
Figure 5.5 – Comparison of stratified and classical cross-validation.
Measure of Performance
A priori it is not clear which measures of performance resp. measures of prediction error should
be used. We will discuss di↵erent possibilities and highlight their advantages and drawbacks. As-
sume we are given a learning sample L := {(xl
1, Nl
1, vl
1), . . . , (xl
m
, N l
m
, vlm
}, where the claims counts
N l
i
are independent with expected claims frequency �(xl
i
) as well as an independent test data set
T := {(x1, N1, v1), . . . , (xn
, Nn
, vn
)}, also with independent claims counts Ni
and expected claims
frequency �(xi
). Based on the training set L, we fit the predictor �(xi
) and then evaluate the
performance on the test data set T , using the following criteria:
Absolute Frequency Deviation: The absolute frequency deviation measure calculates the
sum of absolute deviation of the predicted frequency to the true frequency,
nX
i=1
|�(xi
)� �(xi
)|. (5.1)
The absolute error penalizes already small prediction error. It is therefore the favorite choice when
aiming to find the true underlying structure. Note that typically |�(xi
) � �(xi
)| < 1, hence a
quadratic loss function would shrink the error towards zero, see Figure 5.6. However, due to the
large volume of our portfolio already a small deviation form the true � has a large impact. This
explains the choice of the absolute rather than the squared error. Note there is a hinge regarding
the absolute frequency deviation. In practice, even if we have an independent test data set, we do
not know the function �(x). Therefore its use is limited to simulated data.
Absolute Claims Deviation: The absolute claims deviation measure calculates the sum of
![Page 78: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/78.jpg)
5.3. REGRESSION TREE FOR SIMULATED CLAIMS FREQUENCY 70
−1.0 −0.5 0.0 0.5 1.0
0.0
0.2
0.4
0.6
0.8
1.0
Loss Functions for |x| < 1
x
f(x)
f(x) = x2
f(x) = |x|
Figure 5.6 – Comparison of the absolute versus squared loss for small errors.
absolute deviation of the predicted claims count to the observed claims count,
nX
i=1
|Ni
�Ni
|, (5.2)
where Ni
= �(xi
)vi
. The absolute claims deviation can also be used when the underlying model
structure is unknown. Note that for the simulated data, we set vi
= 1 for all i, therefore the absolute
claims deviation is equal tonX
i=1
�����(xi
)� Ni
vi
���� ,
i.e. the absolute observed frequency deviation.
Out of Sample Deviance: For the out of sample deviance, we evaluate the Poisson deviance
statistics on the test set T ,
nX
i=1
Ni
log
N
i
�(xi
)vi
!� (N
i
� �(xi
)vi
).
The idea of using the deviance statistics is based on the maximum likelihood approach discussed in
Section 2.4.
Out of Sample Prediction Error: The out of sample prediction error compares the total
number of predicted claims N =P
n
i=1 �(xi
)vi
versus the total number of observed claims in the test
data set N :=P
n
i=1Ni
,
|N �N | =
�����
nX
i=1
(�(xi
)vi
�Ni
)
����� . (5.3)
![Page 79: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/79.jpg)
5.3. REGRESSION TREE FOR SIMULATED CLAIMS FREQUENCY 71
A small error means, that we are able to predict the overall number of claims fairly well. Instead of
considering the whole portfolio, we may also consider each risk class separately and use a weighted
sum. Let R1, . . . , RM
denote the regions obtained by the tree algorithm, then consider the prediction
errorMX
m=1
wm
������
X
{i:xi
2Rm
}
(�(xi
)vi
�Ni
)
������,
where wm
, m � 1 denote a priori specified weights. However for the remainder, we only consider the
out of sample prediction error according to equation (5.3).
The choice of measure of performance may depend on the aim of the analysis. For our purpose of
reconstructing the true underlying model based on previously observed data, the absolute frequency
deviation would be the most natural choice. Since it is not available in practice, our goal is to see
whether any of the other options provides similar results. In order to analyze the optimal tree size,
we have generated a learning sample L based on Assumption 4.9. In addition, we have simulated an
independent test data set with n = 100000000 observations. Figure 5.7 shows the above mentioned
measures of performance for each subtree. We observe that the out of sample deviance as well as
the absolute frequency deviation have a similar behavior and attain their minimum at the same
tree size. This is of great importance to us, since on real data we cannot compute the absolute
frequency deviation but we can calculate the out of sample deviance. The absolute claims deviation
is decreasing in the tree size, hence does not prevent from overfitting. For the out of sample prediction
error, we do not observe a clear structure related to the tree size. We conclude that the optimal
tree size lies around 30 leaves. Comparing this result to Figure 5.5 reveals that even when using
stratified cross-validation the 1se rule seems to be too pessimistic. Figure 5.8 shows the predicted
CP
OoS DevianceAbsolute Frequency DeviationAbsolute Claims DeviationOoS Prediction Error
0.00772 0.00015 9e−05 5e−05 4e−05 3e−05 2e−05 2e−05 1e−05 1e−05 1e−05 1e−05 0
1 4 7 11 17 22 29 38 44 50 57 64 73 79 85 91 97 102 110 116
Figure 5.7 – Comparison of optimal tree size according to di↵erent measures of performance. Theabbreviation ’OoS’ stands for ’Out of Sample’.
![Page 80: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/80.jpg)
5.3. REGRESSION TREE FOR SIMULATED CLAIMS FREQUENCY 72
values of the 1se tree versus the optimally sized tree (according to the minimum absolute frequency
deviation) as well as the true underlying structure corresponding to Figure 5.7. Figure 5.8a shows
the predictor when the price is fixed at 30’000 CHF, whereas for Figure 5.8b we fixed the age of
driver at 50 years. We see that both fits have their di�culties to capture the continuous shape of
the true underlying model. This is however not very surprising, due to the definition of the tree
predictor as a piecewise constant function. Apart from that, both predictors capture the underlying
structure very well. In order to further justify these findings, we performed a simulation study, where
20 30 40 50 60 70 80 90
0.10
0.15
0.20
Age of Driver
Cla
ims
Freq
uenc
y
Model1seOptimal
(a) Price fixed at 30’000 CHF
20 40 60 80
0.10
0.15
0.20
Price of Car
Cla
ims
Freq
uenc
y
Model1seOptimal
(b) Age fixed at 50 years
Figure 5.8 – Fitted values for the tree predictors according to 1se rule and optimal tree size comparedto the true underlying structure.
we simulated 100 independent training data sets Lk
for k = 1, . . . , 100 using Assumption 4.9. For
each learning sample, we fit a sequence of tree predictors. Using stratified cross-validation, we then
compute the tree size according to the the 1se rule, the minimum cross-validation error tree size as
well as a third criteria, called Mid-Point rule, where we choose the subtree which is the Mid-Point
between the one standard error rule and the minimal cross-validation error. The Mid-Point tree can
be obtained by:
ind_1se <- which(min(cv_err)+cv_se[which.min(cv_err)] > cv_err)[1]
ind_mp <- ceiling((ind_1se+which.min(cv_err))/2)
In addition to the learning samples, we have simulated a test sample T with n = 100000000 obser-
vations. Each subtree is validated on the test data set T by computing the out of sample deviance
statistics as well as the absolute frequency deviation. From that we calculate the tree size according
to the minimum out of sample deviance resp. minimum absolute frequency deviation. The results
of the simulation study are shown in Figure 5.9. We see that the chosen tree size for both the out
![Page 81: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/81.jpg)
5.3. REGRESSION TREE FOR SIMULATED CLAIMS FREQUENCY 73
of sample deviance and the absolute frequency deviation lies around 30 leaves. We also see that the
one standard error rule is highly over-pessimistic. Figure 5.9a suggests the minimal cross-validation
error rule as an optimal selection criteria in order to obtain a tree size which is roughly equal to
tree with minimal absolute frequency deviation. However in the right panel we see that there is a
large variability associated to the minimum cross-validation error rule. The proposed Mid-Point rule
gives an alternative to the 1se rule, which provides less pessimistic tree sizes but is still conservative
enough to avoid overfitting. Alternatively to the cross-validation based criteria, the out of sample
deviance selects a tree size similar to the absolute frequency deviation.
(a) Chosen tree size for di↵erent measure (b) Di↵erence to optimal tree size
Figure 5.9 – Simulation study results for optimal tree size.
5.3.4 Tree Size Selection via Statistical Significance Pruning
In the last section, we have seen that cross-validation in combination with unbalanced data raises
some issues. In this section, we present an alternative approach to tree pruning based on statistical
significance tests. The idea is to fit a large tree first. Then we perform statistical tests on the leaves
to decide, whether or not the separation into two groups in significant. If not, we prune the leaf away.
We continue this process until no further leaves are pruned o↵, see Algorithm 5.2. This approach
circumvents the problems imposed by cross-validation. The R-Code for the pruning procedure can
be found in the Appendix C.3. Note that this approach does not guarantee the final tree to be an
optimally pruned subtree according to Definition 3.6.
In order to demonstrate the approach, we have simulated a learning data set L using Assumption
4.9 and grew a large tree until no further split was possible. For the statistical significance pruning,
we used the Wilcoxon test at a significance level of ↵ = 5%. We stopped the pruning process when
no further leaf was pruned o↵ for 10 successive bootstrap samples of L. Figure 5.10 shows the
![Page 82: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/82.jpg)
5.3. REGRESSION TREE FOR SIMULATED CLAIMS FREQUENCY 74
Algorithm 5.2 Statistical Significance Pruning
Grow a large tree Tmax
based on a learning set LDefine N
prune
to be the number of leaves, which are pruned o↵ in each iteration.Set N
prune
= 1while N
prune
> 0 doDraw a bootstrap sample L⇤ form L.Use T
max
to predict the values of the test set L⇤.for all Nodes in T
max
with two terminal leaves, leafL
and leafR
doPerform a statistical test based on the Null hypothesis
H0 : leafL = leafR
HA
: leafL
6= leafR
If H0 is rejected at a significance level ↵ merge the two leaves to one leaf.end forUpdate N
prune
Set Tmax
to be the new tree obtained after the pruning.end while
predicted values for the subtrees obtained by statistical significance pruning (Stat), one standard
error rule (1se) as well as the unpruned tree (Full) for fixed price resp. fixed age. Overall we see
that Algorithm 5.2 removed all signs of overfitting. For the simulated learning sample L, the tree
obtained by significance pruning contains 18 leaves (similar to the Mid-Point rule, see Figure 5.9a).
On the other hand the 1se rule tree has 10 leaves.
20 30 40 50 60 70 80 90
0.10
0.15
0.20
Age of Driver
Cla
ims
Freq
uenc
y
ModelFull1seStat
(a) Price fixed at 30’000 CHF
20 40 60 80
0.10
0.15
0.20
Price of Car
Cla
ims
Freq
uenc
y
ModelFull1seStat
(b) Age fixed at 50 years
Figure 5.10 – Predicted values of the subtree obtained by statistical significance pruning (Stat) incomparison to other methods as well as the true underlying structure.
![Page 83: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/83.jpg)
5.3. REGRESSION TREE FOR SIMULATED CLAIMS FREQUENCY 75
The proposed procedure provides an alternative to the cross-validation based cost-complexity prun-
ing. Note that the significance level ↵ was arbitrarily set to 5%. This should rather be seen as
a tuning parameter. Overall the method provided slightly larger trees compared to the 1se rule,
however further research on this method is needed.
5.3.5 Variability of Data
Due to the definition of the tree predictor as a piecewise constant function, we would like to under-
stand how close we can approximate the true underlying function which does not need to be a step
function itself. In this section, we argue that the coe�cient of variation of the data is an import-
ant factor when analyzing the scope to which the tree predictor is able to recover the underlying
structure. Therefore we discuss the variability of the data, measured by the coe�cient of variation,
and its relation to portfolio characteristics such as the claims frequency as well as the volume. In a
second step, we then analyze the e↵ects upon the tree predictor.
Poisson Claims Counts:
In a first step, we investigate the situation for Poisson-distributed claims counts. For each poli-
cyholder i = 1, . . . , n, denote by �(xi
) its expected claims frequency depending on the covariates
x
i
2 Rp. Assume that Ni
⇠ Poi(�(xi
)vi
) for volumes vi
> 0 and i = 1, . . . , n. Then the coe�cient
of variation of the observed frequencies is given by
V co
✓N
i
vi
◆=
1p�(x
i
)vi
. (5.4)
Therefore, if �(xi
)vi
! 0, the coe�cient of variation of the claims frequency becomes infinitely large.
This means that the one standard error confidence bounds, which are given by
CI =
✓E✓N
i
vi
◆� V co
✓N
i
vi
◆E✓N
i
vi
◆,E✓N
i
vi
◆+ V co
✓N
i
vi
◆E✓N
i
vi
◆◆,
become infinity large as well. As discussed in Section 4.2.2, actuaries solve this issue by introducing
risk classes of homogeneous subportfolios. Let N i
1, . . . , Ni
m
i
denote the claims counts of the policies
within risk class i. Assume that N i
j
⇠ Poi(�i
vj
) are independent for j = 1, . . . ,mi
and for all i.
Then the total number of claims of risk class i is given by
N i :=m
iX
j=1
N i
j
⇠ Poi(�i
m
iX
j=1
vj
), (5.5)
![Page 84: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/84.jpg)
5.3. REGRESSION TREE FOR SIMULATED CLAIMS FREQUENCY 76
and denote by vi :=P
m
i
j=1 vj the volume of risk class i. The coe�cient of variation of the expected
frequency of risk class i is given by
V co
✓N i
vi
◆=
1q�i
Pm
i
j=1 vj.
In this light, the increased volume of the subportfolio results in a reduction of the coe�cient of
variation. We have the asymptotic relation,
m
iX
j=1
vj
! 1 ) V co
✓N i
vi
◆! 0, (5.6)
and hence also the one standard error confidence bounds tend to zero.
Remark 5.2. Note there exists a duality between increasing the portfolio size and the claims fre-
quency. For example, if we consider risks on a 10 years basis, then the portfolio size decreases by the
factor 1/10. Therefore the product �(xi
)vi
in equation (5.4) remains constant when varying the time
horizon on which claims frequencies are measured, i.e. increasing �(x) is equivalent to considering a
larger portfolio. J
Negative-Binomial Claims Counts:
As introduced in Section 4.3.1, we use negative-binomially distributed claims counts to account for
overdispersion in the data. Assume that Ni
⇠ NegBin(�(xi
)vi
, �) for i = 1, . . . , n. We can again
compute the coe�cient of variation, which leads to
V co
✓N
i
vi
◆=
s1
�(xi
)vi
+1
�>
1p�.
Hence we see that there is a lower bound for the coe�cient of variation of an individual policy
imposed by the overdispersion parameter. Similar to the Poisson case, we can form risk classes.
Let N i
1, . . . , Ni
m
i
denote the claims counts of the policies within risk class i. Assume that N i
j
⇠NegBin(�
i
vj
, �) are independent for j = 1, . . . ,mi
and for all i. For negative-binomial claims
counts, the analogue to relation (5.5) does not hold true, i.e.
N i :=m
iX
j=1
N i
j
⌧ NegBin(�i
vi, �),
where vi :=P
m
i
j=1 vj denotes the volume of risk class i. However Lemma 5.3 elaborates the coe�cient
of variation including an asymptotic relation similar to (5.6).
Lemma 5.3. Let N i
1, . . . , Ni
m
i
denote the claims counts of policies in risk class i, where N i
j
⇠
![Page 85: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/85.jpg)
5.3. REGRESSION TREE FOR SIMULATED CLAIMS FREQUENCY 77
NegBin(�i
vj
, �) and independent for j = 1, . . . ,mi
. Assume that the volumes vj
are bounded from
below and above, i.e. there exits � > 0 and M < 1 such that 0 < � vj
M < 1. Denote by
N i =P
m
i
j=1Ni
j
the total claims count of risk class i with total volume vi =P
m
i
j=1 vj. Then we have
the relation,
V co
✓N i
vi
◆=
vuuut1
�i
Pm
i
j=1 vj+
1
�
Pm
i
j=1 v2j⇣P
m
i
j=1 vj
⌘2 ! 0 if mi
! 1.
The proof of Lemma 5.3 can be found in Appendix A.6. Although we showed that the variability
tends to zero, the speed of convergence might be significantly slower than for the Poisson case.
However, the next example shows that for our purpose, the di↵erences to the Poisson model are only
marginal.
Example 5.4. To illustrate the impact of the dispersion parameter �, consider the following example.
Assuming all policies in the portfolio have a full year at risk vj
= 1, the coe�cient of variation is
then given by
V co
✓N i
vi
◆=
r1
�i
mi
+1
�
1
mi
.
Let �1 = 1.5 as estimated for the Motor Insurance Collision Data Set and �2 = 0.01 representing a
data set with stronger overdispersion Assume that the claims frequency in risk class i is �i
= 0.1.
Figure 5.11 shows the standard errors
SENegBin = E(N i/vi)V co(N i/vi) =
s�i
mi
✓1 +
�i
�
◆,
SEPoi = E(N i/vi)V co(N i/vi) =
r�i
mi
,
for the Negative-binomial and the Poisson case as a function of mi
, the number of policies in risk
class i. This example illustrates that even tough overdispersion includes extra volatility in the
data, forming subportfolios we can significantly reduce the variability. Furthermore, the di↵erences
between the Poisson and the Negative-binomial model with a dispersion parameter similar to the
Motor Insurance Collision Data Set are only marginal, suggesting that we could also directly work
with Poisson-distributed claims counts.
![Page 86: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/86.jpg)
5.3. REGRESSION TREE FOR SIMULATED CLAIMS FREQUENCY 78
200 400 600 800 1000
0.00
0.02
0.04
0.06
0.08
0.10
Number of Policies in Risk Class i
Stan
dard
Erro
r
Negative−Binomial (γ=1.5)Negative−Binomial (γ=0.01)Poisson
Figure 5.11 – Convergence of the standard errors for Poisson and Negative-binomial claims counts.
?
Implications of the Data Variability on Tree Predictors
We have seen that the variability is controlled by the volume of the portfolio. Increasing the volume
leads to a decrease in variability. Next we are going to illustrate the e↵ect of the data variability
upon the tree algorithm. To understand, why the variability of the data is important to the tree
algorithm, consider the situation with only one covariate (p = 1), e.g. price of car. Assume that we
have recorded data for 4 di↵erent covariate values x1, x2, x3 and x4. For each value, we have 100
observations. Due to the variability within the data, the measurement will scatter around the true
values �(xi
). An illustrative example is shown in Figure 5.12a. The colored points on the y-axis in
Figure 5.12b denote the observed values for the di↵erent covariate values. The orange lines show the
chosen split-points, the red step function denotes an example of a tree predictor and the black line
the true underlying model. The aim of the tree algorithm is to group the observations in order to
recover the underlying structure. We see that, depending on the variability of the observations, the
tree algorithm may be able to distinguish two groups (e.g. blue and red). On the other hand, the
larger the variability, the less the tree is able to recover also small di↵erences (e.g. red and green).
![Page 87: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/87.jpg)
5.3. REGRESSION TREE FOR SIMULATED CLAIMS FREQUENCY 79
Frequency
Num
ber o
f Obs
erva
tions
05
1015
2025
3035
λ(x1) λ(x2) λ(x3) λ(x4)
(a) Variability of observations
Covariate
Frequency
x1 x2 x3 x4
λ(x 1)
λ(x 2)λ(x 3)
λ(x 4)
−−
−
−
−
−
−−−−
−
−
−
−
−
−−−−−−−−
−
−−−
−−−−
−−−
−−−−
−−−−−−
−−
−−−−−
−−
−
−−
−−
−−
−
−−−−−
−
−
−
−
−
−
−
−−
−−−−−−−
−
−
−−−
−−−−
−−−−
−
−−−−
−−−
−−
−
−−−
−
−−
−
−−−−−−−−
−
−−−−−−−−−−
−
−
−
−
−−−−
−
−
−
−−−
−
−
−−
−−−−−−
−
−−
−
−−
−−
−
−
−
−
−−
−
−−−−−−
−
−−
−
−−
−
−−
−
−−−−−
−
−
−−
−
−
−−
−
−−
−
−
−
−−−−−−−−
−−
−−
−−
−
−−−−
−
−
−
−−
−
−
−
−−−−
−
−
−−−−
−
−
−
−−−
−−−−−−−
−
−−−
−
−−
−−
−−−−−−−
−
−
−−
−
−
−
−−−
−
−−−−
−−−
−−−
−
−
−−−
−−
−
−
−
−
−−
−−
−
−−−−−−
−
−−
−
−−−−
−
−−−−
−
−
−
−
−
−
−
−−−−−
−
−
−−
−−
−
−−
−−−−−
−−
−
−
−
−
−
−
−−−−−−
−−−−
−−
−
−−−−
−
−−
−
−−
−
−
−
−
−
−−−−−−
−
−−
−
split1 split2
(b) Recovered structures
Figure 5.12 – Illustration of the impact of the data variability on the tree predictor. The black lineindicates the true structure, the brown line shows the tree predictor. The colored lines on the y-axisdenote the observed frequencies.
5.3.6 Variable Selection
In this section we demonstrate the ability of the tree algorithm to perform variable selection whilst
growing the tree. For that purpose, we have simulated a learning data set using Assumption 4.9.
Recall that the covariates are sampled from the Motor Insurance Collision Data Set. In addition to
age and price, we have also sampled the policynumber. We then modified the policynumber such
that we only keep the last 2 digits. Otherwise the policynumber would be correlated with the age
covariate because they are assigned in ascending order. Since the covariate policynumber is not
included in the model equation (4.5), it should not be used as a splitting variables. In order to verify
this, we have calculated the variable importance for both the unpruned as well as the minimum CV
tree. Figure 5.13 shows that the minimum CV tree does not use the covariate policynumber in any
split neither primary nor surrogate. The finding supports the fact that the tree algorithm is capable
to perform variable selection. However note that this must also be seen in light of Figure 5.12. It
may be that a covariate indeed has a systematic impact on the response, however the variability of
the data is too large such that the tree algorithm is able to recover this structure.
5.3.7 Stability
A major issue when using tree predictors is their variability when changing the training data slightly.
In order to analyze this problem, we have simulated 10 learning data sets Lk
, k = 1, . . . , 10 from
the frequency model (4.5) based on Assumption 4.9. For each data set, we have calculated the tree
![Page 88: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/88.jpg)
5.3. REGRESSION TREE FOR SIMULATED CLAIMS FREQUENCY 80
age price pol
Varia
ble
Impo
rtanc
e0
2040
6080
100
Tree Size: 21 Leaves
(a) Minimum CV Tree
age price pol
Varia
ble
Impo
rtanc
e0
2040
6080
100
Tree Size: 104 Leaves
(b) Unpruned Tree
Figure 5.13 – Variable importance plots for the tree predictors including the non-informative covariatepolicynumber (pol). Subfigure (a) shows the minimum cross-validation error tree. Figure (b) shows theunpruned tree.
predictor using the Mid-Point rule denoted by �MP (x1, x2,Lk
) as well as a GLM model with Poisson
assumption denoted by �GLM (x1, x2,Lk
). For each method, we estimate the coe�cient of variation
at (x1, x2) by
dV co(�1se(x1, x2)) =bsd(�1se(x1, x2,L1), . . . ,�1se(x1, x2,L10))
µ(�1se(x1, x2,L1), . . . ,�1se(x1, x2,L10)),
where bsd denotes the sample standard deviation and µ the sample mean. Analogously we proceed
for �GLM . In Figure 5.14 the results are visualized using a heat map. We observe that the GLM is
0
0.02
0.05
0.07
0.09
0.12
0.14
0.16
0.19
0.21
0.23
20 30 40 50 60 70 80 90
2030
4050
6070
80
Age
Price
(a) Mid-Point Rule Tree
0
0.02
0.05
0.07
0.09
0.12
0.14
0.16
0.19
0.21
0.23
20 30 40 50 60 70 80 90
2030
4050
6070
80
Age
Price
(b) GLM
Figure 5.14 – Stability comparison in terms of coe�cient of variation of the tree predictor and a GLM.
more stable, i.e. less e↵ected by changes in the training data and provides therefore a much more
stable model. These findings are in line with the consideration made in Section 3.4.2.
![Page 89: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/89.jpg)
5.4. UNIVARIATE REGRESSION TREE FOR REAL DATA CLAIMS FREQUENCY 81
5.4 Univariate Regression Tree for Real Data Claims Frequency
In a first step, we fit univariate regression trees to the Motor Insurance Collision Data Set. There
are two main reasons for starting with only one covariate. First of all, the restriction to one covariate
enables us to understand the splits in detail. Secondly, the results can be used to verify the risk
groups shown in Section 4.2.2. The data used for the analysis was preprocessed according to Data
Set A, see Section 4.2.1. We present the detailed analysis for the covariate age of driver. For the
other covariates, we use these findings and present the results directly. For the Poisson method we
use a shrinkage parameter k = 0.87 as derived in Section 5.3.1. Since we can plot the predicted
values, we base the decision of optimal tree size on visual inspection by understanding the relevance
of the chosen splits as well as the underlying volumes. The analysis shows that there is not one
general rule that selects the optimal tree size. In practice, several choices should be tested.
Tari↵ Criterion 1: Age of Driver
We start by fitting a separate tree for each method: Anova, Weighted Anova and Poisson. Figure
5.15 shows the cross-validation results obtained from the plotcp function. The weights for weighted
anova were chosen to be wi
= vi
resulting in the model introduced in Example 3.17.
● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
cp
Rel
ative
Cro
ss−V
alid
atio
n Er
ror
0.8
0.9
1.0
1.1
1.2
Inf 9.1e−06 2.8e−06 1.8e−06 1.3e−06
1 3 4 8 9 15 18 21 23 28Number of Leaves
(a) Method: Anova
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
cp
Rel
ative
Cro
ss−V
alid
atio
n Er
ror
01
2
Inf 3.3e−05 5.1e−06 2.3e−06 1.3e−06
1 3 5 7 9 11 13 15 18 21 23Number of Leaves
(b) Method: Weighted Anova
●
●●
●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
cp
Rel
ative
Cro
ss−V
alid
atio
n Er
ror
0.99
00.
995
1.00
01.
005
1.01
0
Inf 0.00016 3.4e−05 1.4e−05 7.9e−06
1 3 5 7 9 11 13 15 18 21 23Number of Leaves
(c) Method: Poisson
Figure 5.15 – Relative cross-validation error results including one standard error whiskers for the methodsAnova, Weighted Anova and Poisson.
There are two important observations from Figure 5.15:
1. The results for the methods anova and weighted anova are heavily distorted by the large
standard errors. Figure 5.16 shows the same relative cross-validation plot with the standard
error whiskers being removed. The absolute improvements for the methods weighted anova
and Poisson are rather small relative to the root tree. This is similar to the observations in
Section 5.3.2 and can be explained by the unbalanced nature of the data. Overall the weighted
![Page 90: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/90.jpg)
5.4. UNIVARIATE REGRESSION TREE FOR REAL DATA CLAIMS FREQUENCY 82
anova and the Poisson method share a similar shape, compare Figures 5.15c and 5.16b. For
the anova method, we observe an increasing cross-validation error curve, meaning that the tree
is not able to recover any structure from the data. This can be explained by the fact that the
sum of squares measure does not take the underlying volume into account. Based on these
findings, we will not further elaborate on the anova method.
●
● ●● ● ●
● ● ● ● ● ● ● ● ●
cp
Rel
ative
Cro
ss−V
alid
atio
n Er
ror
0.99
900.
9995
1.00
001.
0005
1.00
10
1 3 4 8 9 15 18 21 23 28Number of Leaves
Inf 9.1e−06 2.8e−06 1.8e−06 1.3e−06
(a) Method: Anova
●
●
●
●● ●
● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
cp
Rel
ative
Cro
ss−V
alid
atio
n Er
ror
0.99
900.
9995
1.00
001.
0005
1.00
10
1 3 5 7 9 11 13 15 18 21 23Number of Leaves
Inf 3.3e−05 5.1e−06 2.3e−06 1.3e−06
(b) Method: Weighted Anova
Figure 5.16 – Relative cross-validation error results without one standard error whiskers for the methodsAnova and Weighted Anova.
2. We have seen in Section 5.3.3, that for unbalanced data the variance of the cross-validation
results can be reduced using stratified cross-validation. Due to Remark 5.1, we know that
the cross-validation results for the methods anova and weighted anova are already obtained
using stratification. However for the Poisson method, the rpart package does not apply the
stratification scheme correctly. In Figure 5.17 the stratified cross-validation errors for the
Poisson method including the one standard error whiskers are shown. The colored vertical
lines represent di↵erent choices of tree size.
●
●
●
●●
●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●
2122
021
240
2126
021
280
cp
Cro
ss−V
alid
atio
n Er
ror
0.00131 0.00022 5e−05 3e−05 2e−05 1e−05 1e−05 0
1 3 5 7 9 11 13 15 18 21 23Number of Leaves
1seElbowMid−PointMin CV
Figure 5.17 – Stratified cross-validation error results for the univariate Poisson regression tree predictingthe expected claims frequency using the covariate age of driver.
![Page 91: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/91.jpg)
5.4. UNIVARIATE REGRESSION TREE FOR REAL DATA CLAIMS FREQUENCY 83
We conclude that the cross-validation results for the methods weighted anova and Poisson have
a similar structure. However, using the weighted squared error as a loss function results in large
standard errors (even when using stratified sampling), making the use of the one standard error rule
questionable. Nevertheless, we could use the elbow or minimum cross-validation error rule for the
weighted anova method. Figure 5.18 shows the fitted values for di↵erent subtrees according to the
tree size selection criteria discussed in Section 5.3.3. We see that the trees are able to capture the
characteristics observed in the univariate analysis in Section 4.2.2. Also note that the fitted trees
for the two methods are very similar. In Figure 5.19 the elbow criterion trees for both methods are
shown. We see that the only di↵erence is the split at age < 78, for which the weighted anova method
chooses to split at age < 80 instead. Based on these observations, we conclude that the methods
20 30 40 50 60 70 80 90
0.10
0.12
0.14
0.16
0.18
0.20
0.22
Age of Driver
Cla
ims
Freq
uenc
y
1seElbowMid−PointMin CV
(a) Method: Poisson
20 30 40 50 60 70 80 90
0.10
0.12
0.14
0.16
0.18
0.20
0.22
Age of Driver
Cla
ims
Freq
uenc
y
ElbowMin CV
(b) Method: Weighted Anova
Figure 5.18 – Fitted values for di↵erent choices of tree size.
weighted anova and Poisson result in very similar predictions. Since the deviance measure provides
better cross-validation results in terms of variability, we will focus on the Poisson method for the
remainder.
Remark 5.5. Using the Poisson deviance statistics during cross-validation for the weighted anova
method does not provide an alternative. In general it can happen that � = 0 within some leaves. In
that case, the deviance statistics is not well-defined. J
The predictions obtained by the minimum cross-validation error tree show clear signs of overfitting
the data around ages between 40 to 55. The one standard error tree does not seem to capture the
whole underlying structure, which is in line with the results obtained on simulated data in Section
5.3.3. Visual inspection suggests that the trees derived from the the Mid-Point rule or the elbow
criterion provide the best fits. Finally Figure 5.20 compares the fitted values of the Poisson trees
based on the elbow criterion and the Mid-Point rule with respect to the underlying volumes.
![Page 92: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/92.jpg)
5.4. UNIVARIATE REGRESSION TREE FOR REAL DATA CLAIMS FREQUENCY 84
age >= 34
age < 78
age >= 50
age < 72
age >= 24
age >= 22
0.09714e+3 / 186e+3
0.111474 / 16e+3
0.1116e+3 / 202e+3
0.131510 / 14e+3
0.125812 / 66e+3
0.15696 / 6656
0.19516 / 4286
yes no
(a) Method: Poisson
age >= 34
age < 80
age >= 50
age < 72
age >= 24
age >= 22
0.097n=186e+3
0.11n=20e+3
0.11n=202e+3
0.14n=9244
0.12n=66e+3
0.15n=6656
0.19n=4286
yes no
(b) Method: Weighted Anova
Figure 5.19 – Trees based on the elbow criterion for the methods weighted anova and Poisson.
Age Group
Freq
uenc
y
18−2
223−2
425−3
435−5
051−7
172−7
778−9
90.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
0.15
0.20
020
40
Volu
me
(in %
)
(a) Tree Size: Elbow criterion
Age Group
Freq
uenc
y
18−2
223−2
425−2
627−3
435−5
051−7
172−7
778−8
1>8
10.00
0.05
0.10
0.15
0.20
0.00
0.05
0.10
0.15
0.20
020
40
Volu
me
(in %
)
(b) Tree Size: Mid-Point rule
Figure 5.20 – Risk groups with underlying volumes chosen by the tree algorithm.
Tari↵ Criterion 2: Nationality
The second tari↵ criterion we analyze using univariate regression trees is the covariate nationality.
Due to the reasons explained in the analysis of the covariate age of driver, we always use the Poisson
![Page 93: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/93.jpg)
5.4. UNIVARIATE REGRESSION TREE FOR REAL DATA CLAIMS FREQUENCY 85
method for the sequel. In total there are 160 di↵erent nationalities within the Motor Insurance
Collision Data Set. However, there are also many exotic nationalities that contain only very few
observations. Therefore instead of one accounting year, we will use the accumulated data from 2010
up to 2014 for the analysis. Nevertheless, there still remain nations such as Macao, where we have a
total volume of less than one year at risk. We merge these nationalities with their neighboring country,
in this case China. In areas where we do not have any country with a large enough risk exposure, we
form geographic groups. These are Northern Africa, Sub-Sahara Africa, Central America and Middle
East. We proceed with the merging until each nationality has a total risk exposure of at least 100
years at risk over the period of 5 years. We then use the tree algorithm to determine the risk groups.
Figure 5.21 shows the cross-validation results with possible choices of the tree size. Figure 5.22 shows
●
●
●
● ● ● ● ● ● ● ● ● ● ● ● ●
6780
068
000
6820
0
cp
Cro
ss−V
alid
atio
n Er
ror
0.00369 7e−05 2e−05 1e−05 0 0 0 0 0 0 0
1 2 3 4 5 6 7 8 9 11 13 15Number of Leaves
1seElbowMid−PointMin CV
Figure 5.21 – Stratified cross-validation results for the univariate regression tree predicting the expectedclaims frequency using the covariate nationality.
the fitted values of the tree predictor based on the elbow criterion as well as the Mid-Point rule. We
see that the tree with four leaves has su�cient volume in each leaf, hence there is no need to further
prune the tree towards the 1se tree. For the Mid-Point rule tree, nationality groups 5 to 7 have a
rather small volume, suggesting that an even larger tree would only be reasonable if the first risk
group is further split. However the large volume of the first risk group is mainly explained by the fact
that it contains the Swiss drivers. Comparing the results to the risk groups chosen by AXA shows
that the tree predictor is able to group categorical variables with many levels and small underlying
volumes to risk classes with larger volumes. This provides a large potential for applications of the
tree algorithm. Insurance companies are regularly faced with the problem of forming risk groups
of categorical variables without an a priori ordering between the levels. Instead of tedious manual
work, the tree algorithm is an alternative to obtain a risk-based grouping.
![Page 94: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/94.jpg)
5.4. UNIVARIATE REGRESSION TREE FOR REAL DATA CLAIMS FREQUENCY 86
1 2 3 4
Nationality Group
Freq
uenc
y
0.00
0.05
0.10
0.15
0.20
0.25
0.00
0.05
0.10
0.15
0.20
0.25
020
4060
80
Volu
me
(in %
)
(a) Tree Size: Elbow criterion
1 2 3 4 5 6 7
Nationality Group
Freq
uenc
y
0.00
0.05
0.10
0.15
0.20
0.25
0.00
0.05
0.10
0.15
0.20
0.25
020
4060
80
Volu
me
(in %
)
(b) Tree Size: Mid-Point rule
Figure 5.22 – Risk groups with underlying volumes chosen by the tree algorithm.
Tari↵ Criterion 3: Price of Car
The next tari↵ criterion we analyze is the covariate price of car. Figure 5.23 provides the cross-
validation results as well as the fitted values corresponding to the pruned subtrees. The di↵erent
tree size selection criteria are indicated by the colored vertical lines.
●
●
●
●●
●●●●●●●●●●●●
●●●●●●●●●●●●●
●●●●●●●
●●●●●●●●●●●●
2220
022
250
2230
0
cp
Cro
ss−V
alid
atio
n Er
ror
0.0036 6e−05 3e−05 2e−05 2e−05 1e−05 1e−05 1e−05 1e−05
1 5 12 20 31 36 58 74 89 96Number of Leaves
1seElbowMid−PointMin CV
(a) Cross-validation error
0 50000 100000 150000 200000
0.08
0.10
0.12
0.14
0.16
0.18
Price of Car
Cla
ims
Freq
uenc
y
1seMid−PointMin CV
(b) Fitted values for di↵erent tree sizes
Figure 5.23 – Stratified cross-validation results as well as fitted values for the univariate regression treepredicting the expected claims frequency using the covariate price of car.
![Page 95: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/95.jpg)
5.4. UNIVARIATE REGRESSION TREE FOR REAL DATA CLAIMS FREQUENCY 87
The minimum CV tree shows signs of overfitting at 22’000 CHF as well as at 63’000 CHF. An
interesting features of both the Mid-Point and minimum CV tree is the additional split for very
expensive cars above 175’000 CHF. This may suggest that drivers with a car, whose price exceeds
a certain threshold start to drive more carefully. Another explanation might be that these are cars,
e.g. vintage cars, which are only driven at the weekends, i.e. there may be a correlation between
price and another covariate such as kilometres yearly driven for these expensive cars. The elbow
criterion tree is a subtree of the Mid-Point tree having one leaf less. However since the Mid-Point
tree does not show any sign of overfitting, we did not add the corresponding fitted values in Figure
5.23b. Finally Figure 5.24 illustrates the chosen risk classes with underlying volumes for the 1se and
Mid-Point rule trees. Both trees have reasonably large risk classes, suggesting the conclusion that
the Mid-Point rule is an appropriate choice for the tree size in this case.
<39 40−62 >62
Price Group (in 1'000 CHF)
Freq
uenc
y
0.00
0.05
0.10
0.15
0.00
0.05
0.10
0.15
020
4060
Volu
me
(in %
)
(a) Tree Size: 1se rule
Price Group (in 1'000 CHF)
Freq
uenc
y
<18
18−2
121−3
233−3
940−6
162−8
1
82−1
75>1
75
0.00
0.05
0.10
0.15
0.00
0.05
0.10
0.15
020
4060
Volu
me
(in %
)
(b) Tree Size: Mid-Point rule
Figure 5.24 – Risk groups with underlying volumes chosen by the tree algorithm.
Tari↵ Criterion 4: Kilometres Yearly Driven
Finally, we consider the criterion kilometres yearly driven. Figure 5.25 shows both the cross-validation
results as well as the fitted tree predictors for di↵erent choices of tree size. Note that the elbow
criterion coincides with the Mid-Point rule. The minimum CV tree shows clear signs of overfitting.
On the other hand, the 1se tree only uses one split. This can be interpreted that instead of using
the numerical covariate kilometres yearly driven, we could use a categorical covariate frequent driver.
According to the 1se rule, a frequent driver would drive more than 18’000 km per year. On the
other hand, the elbow criterion suggests a step-wise increase of the claims frequency in the amount
of kilometres yearly driven. Based on the findings in Section 5.3.3 of the 1se rule being consistently
![Page 96: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/96.jpg)
5.4. UNIVARIATE REGRESSION TREE FOR REAL DATA CLAIMS FREQUENCY 88
over-pessimistic, we suggest the elbow criterion as the optimal choice in this case. Figure 5.26 shows
the fitted values with respect to the underlying volumes.
●
●
●
● ● ● ● ● ● ● ● ● ● ●
2224
022
260
2228
022
300
2232
0
cp
Cro
ss−V
alid
atio
n Er
ror
0.00092 0.00019 4e−05 1e−05 0 0 0 0 0 0
1 2 3 4 6 7 8 10 12 13 14 15 16 17
Number of Leaves
1seMid−Point / ElbowMin CV
(a) Cross-validation errors
0 5 10 15 20
0.08
0.09
0.10
0.11
0.12
Kilometres Yearly Driven (in 1'000 km)C
laim
s Fr
eque
ncy
1seMid−Point / ElbowMin CV
(b) Fitted values for di↵erent tree sizes
Figure 5.25 – Stratified cross-validation results as well as fitted values for the univariate regression treepredicting the expected claims frequency using the covariate kilometres yearly driven.
0−18 >18
KM Group (in 1'000 km)
Freq
uenc
y
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.00
0.02
0.04
0.06
0.08
0.10
0.12
020
4060
80
Volu
me
(in %
)
(a) Tree Size: 1se
<6.5 6.5−14 14−18 >18
KM Group (in 1'000 km)
Freq
uenc
y
0.00
0.02
0.04
0.06
0.08
0.10
0.12
0.00
0.02
0.04
0.06
0.08
0.10
0.12
020
4060
80
Volu
me
(in %
)
(b) Tree Size: Elbow criteria
Figure 5.26 – Risk groups with underlying volumes chosen by the tree algorithm.
![Page 97: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/97.jpg)
5.5. BIVARIATE REGRESSION TREE FOR REAL DATA CLAIMS FREQUENCY 89
5.5 Bivariate Regression Tree for Real Data Claims Frequency
In this section, we analyze the fits of the tree algorithm obtained from the Motor Insurance Collision
Data Set using the covariates age of driver and price of car. Both have shown to be important
when conducting the univariate analysis in Section 4.2.2. The restriction to two covariates enables
us to understand the splits and interactions in detail. For the analysis we have prepared the data
of accounting year 2010 according to Data Set A described in Section 4.2.1. Having obtained a
reasonable fit, we then compare the predictions with the GLM model used by AXA on the out of
sample data of 2011 until 2014.
5.5.1 Tree Size Selection
Figure 5.27 shows the stratified cross-validation results of the fitted Poisson regression trees. Possible
tree sizes vary between 12 leaves when using the 1se rule and 28 for the minimum CV error rule.
Applying the elbow criterion seems to be rather delicate, since there is not a clear point beyond
which the cross-validation error stops to decrease. Probably one would choose a tree size between
the 1se and the Mid-Point rule.
●
●
●
●●●●
●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●
2110
021
150
2120
021
250
cp
Cro
ss−V
alid
atio
n Er
ror
0.00338 0.00035 0.00011 7e−05 5e−05 3e−05 3e−05 2e−05
1 4 7 10 15 20 27 31 35 44 49Number of Leaves
1seMid−PointMin CV
Figure 5.27 – Cross-validation results for the bivariate regression tree.
Figure 5.28 shows the fits obtained for di↵erent choices of tree size. In addition to the cross-validation
error based method for choosing the tree size, we also show the tree predictor resulting from statistical
inference pruning, see Section 5.3.4. The Wilcoxon test with significance level ↵ = 5% was used. The
resulting tree has a total of 18 leaves (again similar to the Mid-Point rule). The captured structure
of all methods is close to what we expect from the simulated data in Section 4.3.1. However, instead
of visual inspection, we will use out of sample testing to validate the goodness of fit in Section 5.5.3.
![Page 98: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/98.jpg)
5.5. BIVARIATE REGRESSION TREE FOR REAL DATA CLAIMS FREQUENCY 90
0.06
0.07
0.09
0.1
0.12
0.14
0.15
0.17
0.19
0.2
0.22
20 30 40 50 60 70 80 90
2030
4050
6070
80
Age
Price
(a) Tree Size: 1se
0.06
0.07
0.09
0.1
0.12
0.14
0.15
0.17
0.19
0.2
0.22
20 30 40 50 60 70 80 90
2030
4050
6070
80
Age
Price
(b) Tree Size: Mid-Point Rule
0.06
0.07
0.09
0.1
0.12
0.14
0.15
0.17
0.19
0.2
0.22
20 30 40 50 60 70 80 90
2030
4050
6070
80
Age
Price
(c) Tree Size: Minimum CV
0.06
0.07
0.09
0.1
0.12
0.14
0.15
0.17
0.19
0.2
0.22
20 30 40 50 60 70 80 90
2030
4050
6070
80
Age
Price
(d) Tree Size: Significance Pruning
Figure 5.28 – Fitted values for age 2 [18, 90] and price 2 [150000, 850000] of the bivariate tree predictorsfor di↵erent choices of tree size.
From Figure 5.28 we observe that the trees include multiple interactions. For example, when con-
sidering the 1se tree, we see that for cheap cars below 20’000 CHF, the age covariate is only used for
young drivers below 36 years.
5.5.2 Comparison: Tree and GLM
The standard statistical model used by the majority of insurance companies for the pricing of non-life
insurance products are generalized linear models (GLM). Therefore, we are interested in comparing
![Page 99: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/99.jpg)
5.5. BIVARIATE REGRESSION TREE FOR REAL DATA CLAIMS FREQUENCY 91
the tree predictors to a GLM model. Usually the GLM models are applied in a categorical way
where the risk classes need to be determined a priori. The performance of the GLM however strongly
depends on the chosen reasonable groups. Very small groups may heavily distort the predictions.
Furthermore, variable selection needs to be done in advanced or in a step-wise manner. Inclusion of
irrelevant covariates also negatively e↵ects the performance of a GLM. However since both age and
price are highly significant for predicting the claims frequency this is not an issue in this section.
The fitted GLM used for comparison with the tree predictors is based on the risk classes used by
AXA. In total, these are 8 risk classes for covariate age and also 8 risk classes for the covariate price.
Graphically, a GLM predictor can also be seen as a piecewise constant function fitting a constant in
each region specified by the risk classes.
In order to understand the di↵erences between the tree predictor and the GLM model, we fit a
bivariate GLM model using AXA’s risk groups and calculate the di↵erence between the fitted values
of the GLM and the tree predictors. Denote by �tree the tree predictor and by �GLM the GLM fit.
Figure 5.29 shows the di↵erence
�tree(age, price)� �GLM (age, price), (5.7)
as well as the relative di↵erence
�tree(age, price)� �GLM (age, price)
�GLM (age, price), (5.8)
for age 2 [18, 90] and price 2 [150000 � 850000]. Within the blue areas the tree predictor fits larger
values then the GLM and in the red areas vice versa. It reveals that for large parts the tree and the
GLM model provide very similar fits. The tree is more concerned about older drivers above 80 years.
On the other hand, for young drivers with expensive cars we observe larger frequencies within the
GLM model. This is because the GLM does not include interaction terms. It models factors for the
price classes and the age classes separately. In the univariate analysis we have seen that both young
age and expensive car indicate a high frequency. Therefore in the GLM these drivers are consider
extremely high risk.
![Page 100: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/100.jpg)
5.5. BIVARIATE REGRESSION TREE FOR REAL DATA CLAIMS FREQUENCY 92
−0.18
−0.16
−0.13
−0.11
−0.08
−0.06
−0.04
−0.01
0.01
0.04
0.06
20 30 40 50 60 70 80 90
2030
4050
6070
80
Age
Pric
e (in
1'0
00 C
HF)
(a) Di↵erence (5.7)
−0.49
−0.38
−0.28
−0.17
−0.06
0.05
0.16
0.26
0.37
0.48
0.59
20 30 40 50 60 70 80 90
2030
4050
6070
80
Age
Pric
e (in
1'0
00 C
HF)
(b) Relative di↵erence (5.8)
Figure 5.29 – Comparison of predicted claims frequencies: Tree vs GLM.
5.5.3 Out of Sample Performance
When comparing the methods with respect to cross-validation error, we need to be careful, since
we used cross-validation for model selection in the pruning process. In order to avoid this issue, we
compare the di↵erent methods out of sample. All methods were fit to the data of 2010, which was
preprocessed according to Data Set A. In addition to the tree models, we fit a GLM model with a
Poisson assumption and the risk groups used by AXA. On top, we also compare the methods to the
Null model, which predicts the overall portfolio average to all observations. Based on the findings of
Section 5.3.3, we calculate the mean out of sample deviance for each method, in order to evaluate
the performance. The mean sample deviance is given by,
1Pn
i=1 vi
2
nX
i=1
Ni
log
N
i
�(xi
)vi
!� (N
i
� �(xi
)vi
)
!, (5.9)
where N1, . . . , Nn
denote the claims counts in the out of sample test sets for the models:
1se: Tree predictor using 1se rule.
Mid-Point: Tree predictor using Mid-Point rule.
Min CV: Tree predictor using the minimum CV error rule.
Stat: Tree predictor based on statistical significance pruning.
GLM: Generalized linear model with Poisson assumption.
![Page 101: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/101.jpg)
5.5. BIVARIATE REGRESSION TREE FOR REAL DATA CLAIMS FREQUENCY 93
Null: Model predicting the overall portfolio average for all policies.
For the out of sample testing we only consider observations with a durations equal to one. The
results are shown in Table 5.1. Similar to the observation in Section 5.3.3, the relative di↵erences
Mean Out of Sample Deviance
Year 2011 2012 2013 2014
1se 0.7231008 0.6912631 0.7767323 0.7815325Mid-Point 0.7227214 0.6911108 0.7764441 0.7812819Min CV 0.7226967 0.6910413 0.7764235 0.7811854Stat 0.7228365 0.6911267 0.7765412 0.7813254GLM 0.7227862 0.6911124 0.7763417 0.7812630Null 0.7271414 0.6956903 0.7811100 0.7860495
Table 5.1 – Mean out of sample deviance (5.9) for accounting years 2011 to 2014. All methods were fitbased on the data of accounting year 2010. The best method for each year is highlighted in bold face.
between the models are relatively small. However, all models outperform the Null model. In order
to visualize the di↵erence between the models, Figure 5.30 shows the improvements in mean sample
deviance compared to the Null model. We see that the 1se tree consistently performs worst. This
●
●
●
●
0.00
420.
0044
0.00
460.
0048
Year
Impr
ovem
ent v
ersu
s N
ull M
odel
2011 2012 2013 2014
●
●
●
●
●
●
●
●
●
●
●●
●
● ●
●
1seMid−PointMin CVStatGLM
Figure 5.30 – Improvements of the mean out of sample deviance versus the Null model for the bivariateclaims frequency analysis.
is in line with our observations of the simulation study in Section 5.3.3. Except for accounting year
2013, the minimum CV tree has a smaller mean sample deviance than the GLM model. However
based on the variability between di↵erent years, we conclude that both methods (minimum CV tree
and GLM) have a similar performance.
![Page 102: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/102.jpg)
5.6. MULTIVARIATE REGRESSION TREE FOR REAL DATA CLAIMS FREQUENCY 94
5.6 Multivariate Regression Tree for Real Data Claims Frequency
In a next step, we add further covariates to the model. We first consider all variables that are
also used by AXA in their GLM model, i.e a total of 11 covariates (see Table 5.2). The data is
preprocessed according to both Data Set A and B described in Section 4.2.1. Note that due to the
larger number of covariates, Data Set A may still contain a significant amount of outliers due to
small durations. Hence we fit two tree predictors:
1. Tree based on Data Set A, denoted by ’Tree (A)’,
2. Tree based on Data Set B, denoted by ’Tree (B)’.
Comparing the cross-validation results, shown in Figure 5.31, reveals that using Data Set B results
in larger trees. Possible tree sizes range from 40 leaves (Tree (A), 1se) up to 75 leaves (Tree (B), Min
CV). Figure 5.32 shows the variable importance plots for the minimum CV trees indicated by the
●
●
●
●
●●
●●●
●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
2055
020
650
2075
020
850
cp
Cro
ss−V
alid
atio
n Er
ror
0.00341 0.00013 7e−05 5e−05 4e−05 3e−05 3e−05 2e−05 1e−05
1 15 28 41 56 71 86 107 127 149 170 195 217 240 259
Number of Leaves
1seMid−PointMin CV
(a) Data Set A
●
●
●●
●
●
●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●●●●
1970
019
800
1990
020
000
cp
Cro
ss−V
alid
atio
n Er
ror
0.00366 2e−04 9e−05 7e−05 5e−05 4e−05 3e−05 2e−05 2e−05
1 12 25 37 51 65 78 95 114 133 157 177 195 215 232
Number of Leaves
1seMid−PointMin CV
(b) Data Set B
Figure 5.31 – Stratified cross-validation error results for multivariate tree predictors based on Data SetA and B.
black vertical lines in Figure 5.31. It shows a very similar structure for both trees. The interpretation
of the variable importance should be done with caution. As discussed in Section 3.4.2, trees may look
rather di↵erent when changing the learning sample slightly. For a more robust measure of variable
importance, we should consider the average of many trees as done in bagging, see Section 6.3.2.
Nevertheless the results are in line with the significance levels that the covariates show in the GLM
model analysis.
![Page 103: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/103.jpg)
5.6. MULTIVARIATE REGRESSION TREE FOR REAL DATA CLAIMS FREQUENCY 95Va
riabl
e Im
porta
nce
Price
Deduc
t.Age
Bonus Nat.
Leas
ing
Car Age
Weight
Region
Car Clas
s Km
020
4060
8010
0
(a) Data Set A
Varia
ble
Impo
rtanc
e
Price
Bonus
Deduc
t.Age
Car Age Nat.
Weight
Leas
ing
Region
Car Clas
s Km
020
4060
8010
0
(b) Data Set B
Figure 5.32 – Variable importance for the minimum CV tree predictors on Data Set A and B.
In addition to the tree models above, we fit a further tree predictor, denoted by ’Tree Plus’, containing
the additional variables O�ce, Horse Power, Policy Age and Sex. Table 5.2 summarizes the covariates
included in the di↵erent models. In Figure 5.33 we see that the cross-validation results for the Tree
Variables
Model: GLM Tree Tree Plus
Age* Age AgePrice* Price PriceNat* Nat* Nat*Km* Km KmWeight* Weight WeightBonus Bonus BonusLeasing Leasing LeasingCar Class Car Class Car ClassDeductible Deducible DeducibleRegion Region RegionCar Age* Car Age Car Age
O�ceHorse PowerPolicy AgeSex
Table 5.2 – The star (*) indicates that the variable was grouped in advanced. The grouping was chosento optimize the performance of the GLM, except for Nationality where we used the grouping of AXA forboth GLM and Tree predictors.
Plus predictor are very similar to the previous models. However there are several changes with respect
to the variable importance, shown in Figure 5.34. As before the variable importance measure must
be seen with caution due to the tree instability but also due to correlations among the covariates.
Some of the additional covariates, such as Horse Power and Price or Policy Age and Age are highly
![Page 104: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/104.jpg)
5.6. MULTIVARIATE REGRESSION TREE FOR REAL DATA CLAIMS FREQUENCY 96
●
●●
●
●●●●●●
●●●●●●●●
●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●
2060
020
700
2080
020
900
cp
Cro
ss−V
alid
atio
n Er
ror
0.00449 0.00044 0.00019 0.00012 1e−04 9e−05 7e−05 7e−05 6e−05 5e−05 5e−05
1 6 12 19 26 33 40 49 56 68 76 85 92 102Number of Leaves
1seMid−PointMin CV
Figure 5.33 – Stratified cross-validation error results for the multivariate tree predictor containing ad-ditonal covariates.
Varia
ble
Impo
rtanc
e
Policy A
geWeig
htPric
e
Deduc
t.Age
Bonus Nat.
Office
Horse P
ower
Region
Leas
ing
Car Age
Car Clas
s Km Sex
020
4060
8010
0
(a) Tree (A) Plus
Varia
ble
Impo
rtanc
e
Policy A
gePric
eWeig
ht
Deduc
t.Age
Bonus
Office
Nat.
Region
Leas
ing
Car Age
Car Clas
s Km Sex
020
4060
8010
0
(b) Tree (A) Plus without Horse Power
Figure 5.34 – Variable importance and cross-validation results for Tree (A) Plus predictor.
correlated (see Appendix, Figure B.1), meaning they measure a similar e↵ect. For example the
importance of the covariate Horse Power is now shared with the covariate Price. Although the
variable importance measure of Definition 3.21 takes surrogate splits, i.e. correlations, into account,
when removing either covariate the importance of the remaining may increase. This is supported
by Figure 5.34b, which shows the variable importance of the predictor Tree (A) Plus where the
covariate Horse Power was excluded. Overall this results in an increase of the variable importance
of the remaining covariate Price. Therefore further research is needed to analyse to what extent
the variable importance according to Definition 3.21 is influenced by correlated covariates. However
since the importance of Policy Age is significantly larger than Age, Figure 5.34a may suggest to use
Policy Age rather than Age in the GLM model. We conclude that the variable importance measure
can also be used for variable selection. In particular, if the number of covariates becomes much
larger, the tree algorithm and associated variable importance may help to determine which variables
![Page 105: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/105.jpg)
5.7. SUMMARY 97
to include in a GLM model.
5.6.1 Out of Sample Performance
Analogously to the bivariate case in Section 5.5.3, we evaluate the performance of the above predictors
out of sample. The validation data set contains all observations of accounting years 2011 to 2014
according to Data Set A with a full duration of one year at risk. The mean out of sample deviance
(5.9) for each accounting year and method is shown in Table 5.3. Figure 5.35 shows the improvements
Mean Out of Sample Deviance
Year 2011 2012 2013 2014
Min CV (A) 0.7196545 0.6863301 0.7702547 0.77375761se (A) 0.7199627 0.6870283 0.7708778 0.7742580Mid-Point (A) 0.7198493 0.6866617 0.7706358 0.7740121Min CV Plus (A) 0.7200256 0.6868270 0.7706587 0.7740368Min CV (B) 0.7196465 0.6864650 0.7701494 0.77382391se (B) 0.7205028 0.6873175 0.7707433 0.7744600Mid-Point (B) 0.7199539 0.6867904 0.7703110 0.7738522GLM (A) 0.7172632 0.6843318 0.7688065 0.7715159Null (A) 0.7290833 0.6957688 0.7808736 0.7842622
Table 5.3 – Mean out of sample deviance for numerous multivariate models. The capital letter (A)and (B) indicate the data set, which was used for fitting the method. The best method for each year ishighlighted in bold face.
in mean sample deviance compared to the Null model. The GLM model performs consistently best
for all years. For the tree predictors we again observe that the 1se rule is not able to fit a tree
large enough to capture all relevant features and therefore performing consistently worst. The tree
predictor Min CV Plus (A), derived from the model with additional covariates, preforms slightly
worse than the corresponding fit with less variables. Unfortunately, in contrast to the bivariate fit,
the tree predictors are not able to compete with the GLM model. A possible explanation may be a
large increase in variance of the tree predictor when including more explanatory variables. This is
supported by the observation that the predictor Min CV Plus (A) performs worse then Min CV (A)
and Min CV (B).
5.7 Summary
In the present chapter we applied the tree algorithm both to simulated data and the Motor Insur-
ance Collision Data Set of AXA Winterthur. In Section 5.3.2, we have observed that, due to the
unbalanced nature of our data, the improvements of the optimal tree predictor to the Null model
are relatively small. However using stratified cross-validation, we managed to significantly decrease
the uncertainty estimates of the cross-validation error. This is crucial since the most common se-
![Page 106: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/106.jpg)
5.7. SUMMARY 98
● ●
●●
0.00
80.
010
0.01
2
Year
Impr
ovem
ent v
ersu
s N
ull M
odel
2011 2012 2013 2014
●
●
● ●
●●
● ●
●●
●
●
●●
●
●
●●
●●
●
●
●
●
●●
● ●
Min CV (A)1SE (A)Mid−Point (A)Min CV (B)1SE (B)Mid−Point (B)Min CV (Plus)GLM
Figure 5.35 – Improvements of the mean out of sample deviance versus the Null model for the mul-tivariate claims frequency analysis.
lection for the tree size, called 1se rule, takes both the cross-validation error and its uncertainty
estimate into account. Using simulated data, we were able to compute the optimal tree size with
respect to the absolute frequency deviation (5.1). The analysis showed, even when using stratified
sampling, that the one standard error rule is consistently over-pessimistic regarding the tree size.
The simulation suggested that on average the minimum cross-validation error tree provides the best
fit. However note that large variability in the chosen tree size is associated to the Min CV rule,
see Figure 5.9. Our simulation also showed that the out of sample deviance provides very similar
performance results as the absolute frequency deviation. This is of great importance since the ab-
solute frequency deviation is not available if the true underlying model is not known. Analyzing
the Motor Insurance Collision Data Set, we have observed that the Anova method is note able to
correctly handle observations without a full year at risk. On the other hand, the methods Poisson
and Weighted Anova meaningfully include the underlying volumes and also provide very similar fits.
Since the Poisson method can be used in combination with the Poisson deviance, we conclude that
for our data the Poisson method is the preferable choice. Using the tree algorithm to fit univariate
regression trees, we were able to form subporfolios with similar risk characteristics, see Section 5.4.
This provides many potential applications in practice. In particular, when faced with unordered
covariates for which a priori no structure is known. In Section 5.6, we fit a multivariate regression
tree for the claims frequency. However using the out of sample deviance as a measure of performance,
we conclude that a multivariate tree predictor is not able to compete with a classical GLM model in
terms of predictive capacity.
![Page 107: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/107.jpg)
Chapter 6
Data Analysis: Bagging
In this chapter, we apply the tree ensemble method bagging to insurance claims data. We use the
same structure as in the previous chapter: In the beginning, we analyze the algorithm using simulated
data and in a second step we apply the method to the Motor Insurance Collision Data Set.
6.1 Fitting a Bagged Tree Predictor in R
A bagged tree predictor can be obtained straightforward from the rpart package. For that purpose,
we have written a wrapper-function, which fits B regression trees to a bootstrapped data set and
returns a list storing all B predictors. The function can be used analogously to the rpart function
with two additional parameters maxnodes, for the individual tree size and ntree for the number of
bagged tree predictors to be fit. The corresponding R-Code can be found in Appendix C.4. We can
fit a bagged tree predictor using the function bagging as follows:
bagged <- bagging(cbind(v, N)~ x1+...+xp, data = data,
method="poisson", parms = list(shrink = 3.5),
control=rpart.control(minbucket=4000, cp=0.0002),
maxnodes = 8, ntree = 50)
Remark 6.1. Note that the cp parameter must be chosen small enough, such that the unpruned
tree has at least maxnodes leaves. J
In order to fit a bagged tree predictor, we need to determine two additional parameters: maxnodes
and ntree. Regarding the number of bootstrapped tree predictors (ntree), larger values provide a
better approximation of the bootstrap expectation of equation (3.16). Therefore the choice of ntree
is only limited by computing time. The size of the individual trees on the other hand is more directly
linked to the quality of the fit. From Section 3.4.1, we know that the benefit of bagging results purely
99
![Page 108: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/108.jpg)
6.2. BAGGING FOR SIMULATED CLAIMS FREQUENCY 100
from a variance reduction. Since the bagged tree predictor has the same bias as an individual tree,
we can use our understanding of the regression tree methodology (see Chapter 3) for an optimal
choice of maxnodes. The choice of both parameters in the context of insurance claims count data is
discussed in greater details in the next section.
6.2 Bagging for Simulated Claims Frequency
Analogously to Section 5.3, we start the analysis on a simulated data set. We use the model intro-
duced in Section 4.3 with the parameters given in Assumption 4.9.
6.2.1 Tree Size of Individual Trees
First, we investigate the impact of the tree size upon the bagged tree predictor. For that purpose,
we have generated a learning sample L using Assumption 4.9. The first predictor was fit using
maxnodes = 5, the second using maxnodes = 50 and the third using maxnodes = 100. For all cases
we have used ntree = 100. Figure 6.1 summarizes the results and indicates that maxnodes = 5 is too
small. The bagged predictor is not capable of capturing the whole shape of the underlying structure.
For maxnodes = 50, the resulting fit is much closer to the true function. When increasing maxnodes
to 100, we see that the predictor becomes more wiggly, which we can interpret as a sign for overfitting.
The reason for this behavior is due to the fact, that improvements through bagging are purely based
on a variance reduction. As shown in Section 3.4.1, the bias of the bagged predictor equals the bias
of a single tree. Considering the mean squared error as a measure of goodness, we should choose
maxnodes such that the bias of an individual tree is minimized in order to get an optimal bagged
tree predictor. Recall from Proposition 2.2, that we have the following decomposition,
E⇣(Y � Y )2
⌘= E
✓Bias
⇣f(X,L)|X
⌘2◆+ E
⇣V ar
⇣f(X,L)|X
⌘⌘+ V ar("). (6.1)
In Chapter 3, we have seen that a tree predictor with roughly 30 leaves minimizes the generalization
error. Due to the bias-variance-trade-o↵ and the fact that the generalization error of the individual
tree itself splits into bias and variance according equation (6.1), we expect the optimal maxnodes
parameter to be slightly larger than 30. The reason for a choice slightly larger than the minimum
cross-validation error tree is due to the bias-variance-trade-o↵ in the number of leaves, where we
expect the bias to decrease resp. the variacne to increase when growing a larger tree.
![Page 109: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/109.jpg)
6.2. BAGGING FOR SIMULATED CLAIMS FREQUENCY 101
20 30 40 50 60 70 80 90
0.05
0.10
0.15
0.20
Age of Driver
Cla
ims
Freq
uenc
y
20 30 40 50 60 70 80 90
0.05
0.10
0.15
0.20
Age of Driver
Cla
ims
Freq
uenc
y
(a) Parameters: maxnodes = 5, ntree = 100
20 30 40 50 60 70 80 90
0.05
0.10
0.15
0.20
Age of Driver
Cla
ims
Freq
uenc
y
20 30 40 50 60 70 80 90
0.05
0.10
0.15
0.20
Age of Driver
Cla
ims
Freq
uenc
y
(b) Parameters: maxnodes = 50, ntree = 100
20 30 40 50 60 70 80 90
0.05
0.10
0.15
0.20
Age of Driver
Cla
ims
Freq
uenc
y
20 30 40 50 60 70 80 90
0.05
0.10
0.15
0.20
Age of Driver
Cla
ims
Freq
uenc
y
(c) Parameters: maxnodes = 100, ntree = 100
Figure 6.1 – Simulation Results: Bagged tree predictors for di↵erent values of maxnodes.
![Page 110: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/110.jpg)
6.2. BAGGING FOR SIMULATED CLAIMS FREQUENCY 102
6.2.2 Number of Trees
Theoretically, we can choose ntree as large as we want. However, limitations are imposed by com-
puting time. In order to understand the dependency on the number of bagged trees, we repeated the
analysis of the previous section for ntree 2 {10, 100, 200}. The fitted predictors are shown in Figure
6.2. It shows that for our data set ntree = 100 seems to be reasonably large.
20 30 40 50 60 70 80 90
0.05
0.10
0.15
0.20
Age of Driver
Cla
ims
Freq
uenc
y
Modelntree = 10, maxnodes = 5ntree = 100, maxnodes = 5ntree = 200, maxnodes = 5
20 30 40 50 60 70 80 90
0.05
0.10
0.15
0.20
Age of Driver
Cla
ims
Freq
uenc
y
Modelntree = 10, maxnodes = 50ntree = 100, maxnodes = 50ntree = 200, maxnodes = 50
Figure 6.2 – Impact of parameter ntree on the bagged tree predictor.
6.2.3 Out of Sample Performance
In this section, we compare the predictive power of the bagged tree predictor to the single tree
predictor using the out of sample deviance statistics introduced in Section 5.3.3. We have generated
10 independent training data sets Lk
for k = 1, . . . , 10 using Assumption 4.9 and an independent
validation data set with n = 100000000 observations. Based on the training data sets we have fit four
predictors each:
1. Bagged Tree: A bagged tree predictor using maxnodes = 50 and ntree = 100.
2. Min CV Tree: A pruned tree predictor according to the minimum CV error rule using stratified
cross-validation.
3. 1se Tree: A pruned tree predictor according to the 1se rule using stratified cross-validation.
4. GLM Model: A GLM with Poisson assumption model based on AXA’s risk classes.
The resulting deviances are shown in Figure 6.3. The simulation study shows that bagging not only
improves the accuracy of the tree predictor, but also reduces the variability of the out of sample
deviance statistics. Note that bagging results in a deviance reduction even tough in general we could
![Page 111: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/111.jpg)
6.3. BAGGING FOR REAL DATA CLAIMS FREQUENCY 103
only prove this for the mean squared error, see Section 3.4.1. Additionally, we observe that whereas
the GLM outperforms the classical regression trees, itself is again outperformed by the bagged tree
predictor.
●
Bagging Min CV 1se GLM4696
0047
0000
4704
0047
0800
Out
of S
ampl
e D
evia
nce
Figure 6.3 – Performance of the bagged tree predictor in comparison to other models using the out ofsample deviance.
6.3 Bagging for Real Data Claims Frequency
In this section, we fit the bagged tree predictor to the Motor Insurance Collision Data Set. We start
with a bivariate predictor and then generalize to higher dimensions.
6.3.1 Bivariate Bagged Tree Predictor
To begin with, we fit a bagged tree predictor using the covariates age of driver and price of car,
analogously to Section 5.5. Based on the findings of Section 5.5.3, we choose maxnodes = 50 and
based on Section 6.2.2 we set ntree = 100. Figure 6.4b shows the resulting bagged tree predictor
and Figure 6.4a an individual tree used for aggregation. We see that bagging reduces the discrete
nature of a regression tree. Table 6.1 compares the mean out of sample deviance of the bagged tree
predictor with the optimal tree predictor (see Section 5.5.3) as well as the GLM and the Null model
introduced Section 5.5. We conclude that bagging has significantly improved the tree predictor.
Making it consistently better then the GLM model.
![Page 112: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/112.jpg)
6.3. BAGGING FOR REAL DATA CLAIMS FREQUENCY 104
0.05
0.07
0.09
0.11
0.13
0.15
0.17
0.19
0.21
0.23
0.25
20 30 40 50 60 70 80 90
2030
4050
6070
80
Age
Price
(a) Individual tree predictor
0.05
0.07
0.09
0.11
0.13
0.15
0.17
0.19
0.21
0.23
0.25
20 30 40 50 60 70 80 90
2030
4050
6070
80
Age
Price
(b) Bagged tree predictor
Figure 6.4 – Visualization of the bivariate bagged tree predictor.
Mean Out of Sample Deviance
Year 2011 2012 2013 2014
Bagging 0.7241693 0.6907991 0.7757639 0.7786711
Tree 0.7246993 0.6911062 0.7763071 0.7792221
GLM 0.7247417 0.6911917 0.7760849 0.7791561
Null 0.7290833 0.6957688 0.7808736 0.7842622
Table 6.1 – Out of sample performance of the bivariate bagged tree predictor versus other methods.The best method for each year is highlighted in bold face.
6.3.2 Multivariate Bagged Tree Predictor
Next we generalize the predictor using additional covariates. First we fit a multivariate bagged tree
predictor using the same covariates as included in AXA’s GLM model. According to Section 5.6,
we expect the optimal tree size of an individual tree to be roughly between 60 and 80 leaves. Based
on the findings in Section 6.2, we set the parameter ntree = 100 and compare di↵erent choices
of the tuning parameter maxnodes 2 {60, 70, 80, 90, 100}. The performance was evaluated out of
sample using the mean out of sample deviance (5.9). The results are summarized in Table 6.2.
The di↵erent predictors are named according to the chosen maxnodes parameter. Analogously to
Section 5.6, we also show the results when adding further covariates to the predictor (see Table 5.2).
The corresponding out of sample performances for various choices of parameters are also shown in
Table 6.2 and denoted by the su�x (l). In contrast to the single tree predictor, see Section 5.6,
![Page 113: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/113.jpg)
6.3. BAGGING FOR REAL DATA CLAIMS FREQUENCY 105
adding additional covariates results in an improvement of the predictive capacity. The predictors
Mean Out of Sample Deviance
Year 2011 2012 2013 2014
m60 0.7177994 0.6849366 0.7683892 0.7720549m70 0.7175936 0.6847973 0.7682821 0.7719311m80 0.7174927 0.6847499 0.7681714 0.7718043m90 0.7173910 0.6846422 0.7680176 0.7716766m100 0.7143560 0.6823380 0.7720946 0.7759560
m80 (l) 0.7174013 0.6843935 0.7678068 0.7715305m90 (l) 0.7212701 0.6776055 0.7749489 0.7787537m100 (l) 0.7213253 0.6770380 0.7750003 0.7787436
Table 6.2 – Performance of the bagged tree predictor for di↵erent choices of tuning parameters. Thesu�x (l) marks the models with additional covariates according to the Tree Plus model shown in Table5.2. The best method for each year is highlighted in bold face.
with the best out of sample performance are highlighted in bold face. We see that for both models
maxnodes 2 [80, 90] seems to be a reasonable choice of the tuning parameter. This is in line with our
expectation based on the analysis of the multivariate tree predictor in Section 5.6 and the discussion
in Section 6.2.1.
6.3.3 Out of Sample Performance
So far we have fit both bivariate and multivariate bagged tree predictors and discussed the choice
of the tuning parameter maxnodes. Next, we compare their performance out of sample to the
multivariate tree predictor (see Section 5.6), the GLM model as well as the Null model. Again, we
use the mean out of sample deviance as a measure of performance. Table 6.3 shows the corresponding
values for each method and accounting year. In Figure 6.5 we show the improvements with respect
Mean Out of Sample Deviance
Year 2011 2012 2013 2014
Bagged (m90) 0.7173910 0.6846422 0.7680176 0.7716766Bagged (m80, l) 0.7174013 0.6843935 0.7678068 0.7715305Bagged (bivariate) 0.7241693 0.6907991 0.7757639 0.7786711Tree 0.7196545 0.6863301 0.7702547 0.7737576GLM 0.7172632 0.6843318 0.7688065 0.7715159Null 0.7290833 0.6957688 0.7808736 0.7842622
Table 6.3 – Out of sample performance of the multivariate frequency models. The best method for eachyear is highlighted in bold face.
to the mean out of sample deviance of the Null model. We conclude that the multivariate bagged
tree predictor is able to compete in terms of predictive capacity with the GLM model. In contrast to
![Page 114: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/114.jpg)
6.3. BAGGING FOR REAL DATA CLAIMS FREQUENCY 106
●
●
●
●
0.00
90.
010
0.01
10.
012
0.01
3
Year
Impr
ovem
ent v
ersu
s N
ull M
odel
2011 2012 2013 2014
●
●
●
●
● ●
●●
●
●
●
●
Bagging (maxnodes = 90)Bagging (maxnodes = 80, large)Tree (Min CV)GLM
Figure 6.5 – Improvements of the mean out of sample deviance versus the Null model for the multivariateclaims frequency analysis.
Section 5.6, the inclusion of additional covariates increased the predictive capacity. This shows the
ability of the bagged tree predictor to cope with many (possibly correlated) covariates. In situations
were the structure of the data is not known a priori, this is a large benefit over the GLM model. Note
that we have used out of sample testing to calibrate the parameter maxnodes. Hence the comparison
against the GLM is slightly biased in favour of the bagged tree predictor.
6.3.4 Visualization
Bagging is a black box model, i.e. there is no direct way to fully describe the fitted predictor in a
closed form. However for an insurance company it is essential to not only achieve high predictive
capacity but also to understand the influence of individual covariates on the predicted values. In
order to understand the rational behind the predictions, we proceed with visualizing the model by
applying the methods introduced in Section 3.4.4.
Variable Importance:
First we compare the variable importance of the bagged tree predictors Bagging (m90) and Bagging
(m80, l). Figure 6.6 shows the importance plot for both predictors. Comparing the results to the
variable importance plots of the multivariate tree predictors shown in Figures 5.32 and 5.34 we
observe a similar picture. We again note that some of the additionally included covariates appear
to have a very prominent impact. Similar to the discussion for multivariate trees in Section 5.6,
caution is required when interpreting the variable importance. Correlations among the covariates
can disturb the picture. Figure B.1 in the Appendix shows the presence of high correlations between
the original and additionally added covariates. For example Horse Power and Price are highly
![Page 115: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/115.jpg)
6.3. BAGGING FOR REAL DATA CLAIMS FREQUENCY 107Va
riabl
e Im
porta
nce
Price
AgeWeig
ht
Deduc
t.
Region Nat.
Bonus
Car Age
Leas
ing
Car Clas
s Km
020
4060
8010
0
(a) Bagging (m90)
Varia
ble
Impo
rtanc
e
Policy A
gePric
eWeig
htOffic
eAge
Deduc
t.
Bonus
Horse P
ower
Car Age Nat.
Car Clas
s
Region
Leas
ing Km Sex
020
4060
8010
0
(b) Bagging (m80, l)
Figure 6.6 – Variable importance plots for the bagged tree predictors Bagging (m90) and Bagging (m80,l).
correlated, removing Horse Power may significantly increase the importance of Price. Note since we
have seen that the tree predictor is able to filter informative from non-informative covariates (see
Section 5.3.6), the variable importance plot of a bagged tree predictor provides a useful tool for
variable selection if a large number of explanatory variables is available.
Individual Conditional Expectation Plots:
Next we present the partial dependence and individual conditional expectation (ICE) plots (see
Section 3.4.4) for the bagged tree predictor Bagging (m90). Figure 6.7 shows the numerical covariates
and Figure 6.8 the categorical covariates. Overall we observe very similar characteristics as expected
from the univariate analysis in Section 4.2.2, suggesting that the bagged tree predictor is able to
capture all relevant features of the Motor Insurance Collision Data Set. The results are also in line
with the variable importance plots shown in Figure 6.6. For example the covariate kilometres yearly
driven does not show much di↵erentiation over the whole range of values, which supports the findings
of Figure 6.6 that it has only a marginal influence.
Also note that the majority of lines for all covariates are parallel. This indicates that there are
no dependencies of the form that the influence of a particular covariate changes, based on changes
in other covariates, see Section 3.4.4 for more details. For better visibility, Figures 6.7 and 6.8 show
the individual conditional expectation curves for a random subsample (n = 10000) of the data. For
the categorical covariate we have connected the predictions to maintain the interpretation of parallel
curves.
![Page 116: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/116.jpg)
6.3. BAGGING FOR REAL DATA CLAIMS FREQUENCY 108
Figure 6.7 – Individual conditional expectation plots for numerical covariates.
Figure 6.8 – Individual conditional expectation plots for categorical covariates.
![Page 117: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/117.jpg)
6.4. SUMMARY 109
6.4 Summary
In the present chapter, we applied the bagged tree predictor to insurance claims data. In Section
6.2, we observed that maxnodes is the relevant parameter in terms of predictive capacity. We
conclude that a reasonable choice can be obtained when using a value slightly larger than the size
of the minimum cross-validation error tree. For the parameter ntree, we can say that the larger the
better. However in Section 6.2.2, we have seen that for our data ntree = 100 is already su�ciently
large to obtain a good approximation of the bootstrap expectation. Although we could not prove
that bagging positively a↵ects the predictive capacity in terms of the Poisson deviance statistics, we
observed an improvement of the bagged tree predictor compared to an individual tree predictor (see
Figure 6.3 and Figure 6.5). When applying the method to the Motor Insurance Collision Data Set
we observed that in contrast to the individual tree predictor including additional variables does not
negatively a↵ect the predictive capacity. However caution is required when interpreting the variable
importance plots. In terms of out of sample performance, the bagged tree predictor was able to
compete with the GLM model (see Figure 6.5). We conclude that bagging provides a competitive
alternative to the well-known GLM model. In particular, it can also be used when little or nothing
is known about the structure of the data. It is also able to deal with many covariates including
non-informative or correlated variables. In order to visualize the model, we have analyzed the ICE
plots. The observed characteristics are in line with our observations of the univariate analysis.
![Page 118: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/118.jpg)
Chapter 7
Data Analysis: Random Forest
In this section, we apply the Random Forest algorithm introduced in Section 3.4.3 to insurance
claims data. In the beginning, we analyze the algorithm using simulated data and in a second step
we apply the method to the Motor Insurance Collision Data Set.
7.1 Fitting a Random Forest Predictor in R
We use the R package randomForest to fit the random forest predictor. Note that the implementation
uses the mean squared error as the minimization criterion. The syntax to fit a random forest is very
similar to bagging. Let y = N/v be the response and x1, . . . , xp the covariates, then a random forest
can be fit by:
rf <- randomForest(y ~ x1+...+xp, data = data, mtry = floor(p/3), ntree = 100,
maxnodes = 30)
The additional parameter mtry is used to specify the number randomly selected covariates among
which the optimal splits is searched for, see Algorithm 3.3.
Unfortunately, there does currently not exist an implementation of the random forest algorithm
in R, which is based on the weighted mean squared error or the Poisson deviance statistics. This is
not particularly optimal considering an application to insurance claims data, which contains outliers
due to small durations. A further drawback of the randomForest implementation is that is does not
support surrogate splits and therefore cannot handle missing data.
7.2 Random Forest for Simulated Claims Frequency
In order to analyze the performance of the random forest algorithm, we perform a simulation study,
where we simulate 10 training data set Lk
for k = 1, . . . , 10 based on Assumption 4.9. Additionally,
110
![Page 119: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/119.jpg)
7.3. RANDOM FOREST FOR REAL DATA CLAIMS FREQUENCY 111
we generate a test data set with n = 100000000 observations. Next we fit the random forest predictor
to each training data set and validate the performance out of sample on the test data set. Note that
when simulating from Assumption 4.9, the data only contains two covariates. On the other hand,
random forest is designed for high-dimensional problems, hence may not unfold all its potential in
this simulation study. In particular the model is not suitable to analyze the tuning parameter mtry.
However, we can use it to test di↵erent choices of maxnodes 2 {5, 10, 20, 30, 50, 70, 90, 110}. We set
mtry = 1 and based on Section 6.2.2 we use ntree = 100. Figure 7.1 shows the performance of the
random forest predictors and in addition the GLM model as well as the bagged tree predictor. We
observe that the optimal choice of maxnodes is again obtained at 50. This is again explained by
the fact that random forest is based on a variance reduction whilst only marginally changing the
bias, for more details see Section 6.2.1. Although the performance of the random forest predictor is
slightly worse than the bagged tree predictor, this may change when considering a problem of higher
dimensionality as explained above. Compared to the GLM model, the random forest predictor
performs consistently better. Also note that the slightly poorer performance of the random forest
●
4695
0047
0000
4705
0047
1000
4715
00O
ut o
f Sam
ple
Dev
ianc
e
rf (m =
110)
rf (m =
90)
rf (m =
70)
rf (m =
50)
rf (m =
30)
rf (m =
20)
rf (m =
10)
rf (m =
5)
Baggin
gGLM
Figure 7.1 – Performance of the random forest for di↵erent choices of maxnodes (abbreviated by m),the optimal bagged tree predictor as well as the GLM model on a simulated claims frequency data set.
predictor compared to the bagged tree predictor may originate from the fact that the random forest
implementation is based on minimizing the mean squared error, whereas for our problem the Poisson
deviance statistics seems more appropriate. This issue becomes much more severe when the data
contains a large amount of observations without a full duration.
7.3 Random Forest for Real Data Claims Frequency
Next we apply the random forest algorithm to the Motor Insurance Collision Data Set. We include
the covariates used in AXA’s GLM model. The training data was preprocessed according to both
Data Set A and B, see Section 4.2.1. Since the random forest algorithm does not take the volume
![Page 120: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/120.jpg)
7.3. RANDOM FOREST FOR REAL DATA CLAIMS FREQUENCY 112
explicitly into account, we expect the performance to be negatively a↵ected by small durations.
Therefore we perform the analysis on both data sets. We set the number of random forest trees
to ntree = 100 and the size of a single tree to maxnodes = 90 based on the finding in Section
6.3.2 as well as Section 7.2. For the tuning parameter mtry we fit two random forest predictors with
mtry 2 {3, 8}, where 3 corresponds to the rule of thumbmtry = bp/3c. The performance is evaluated
out of sample using the mean sample deviance. Table 7.1 shows the obtained results including the
Null model for comparison. We see that the random forest predictors is not even capable to compete
with the Null model. In order to understand the reason for the bad performance, we have shown
Mean Out of Sample Deviance
Year 2011 2012 2013 2014
randomForest (A, mtry = 3) 0.8093080 0.7900853 0.8774162 0.8656559randomForest (A, mtry = 8) 0.8270472 0.8099446 0.9041067 0.8947512randomForest (B, mtry = 3) 0.7889306 0.7663895 0.8546950 0.8434086randomForest (B, mtry = 8) 0.8124648 0.7953953 0.8939658 0.8755047Null 0.7290833 0.6957688 0.7808736 0.7842622
Table 7.1 – Out of sample performance of various random forest predictors based on di↵erent trainingdata sets and choice of mtry parameters. For comparison the performance of the Null model is alsoshown.
the ICE plots of the covariates age and price in Figure 7.2 based on the random forest predictor
randomForest(B, mtry = 3). It reveals that the predicted values are heavily distorted.
0 50000 150000 250000
020
4060
80
Price
Parti
al D
epen
denc
e
20 40 60 80 100
020
4060
80
Age
Parti
al D
epen
denc
e
Figure 7.2 – ICE plot for the random forest predictor randomForest(B, mtry=3) showing heavily dis-torted predictions.
![Page 121: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/121.jpg)
7.3. RANDOM FOREST FOR REAL DATA CLAIMS FREQUENCY 113
7.3.1 Restriction to Full Durations
Based on the findings of Figure 7.2, we suspect that the random forest algorithm is not able to deal
with frequency outliers due to small durations. Since preparing the data according to Data Set B
did not solves the issue, we fit the random forest predictor again, however only using observations
with a full year at risk. To asses the quality of the predictor, the mean out of sample deviance was
evaluated. Table 7.2 shows the results for the random forest predictors as well as the Null model. We
Mean Out of Sample Deviance
Year 2011 2012 2013 2014
randomForest (B, yr = 1, mtry = 3) 0.7184949 0.6834849 0.7704677 0.7745797randomForest (B, yr = 1, mtry = 8) 0.7188954 0.6829554 0.7706336 0.7693152Null 0.7290833 0.6957688 0.7808736 0.7842622
Table 7.2 – Out of sample performance for random forest predictors based on the data set restricted tofull durations only. The best method for each year is highlighted in bold face.
observe that when removing the small durations, the random forest predictor drastically improves.
Note that so far, we always fixed maxnodes = 90. However we may conjecture that the optimal
choice of maxnodes may also depend on the parameter mtry. In order to better understand the
impact of the tuning parameters, we have performed a grid search for the optimal parameters varying
mtry 2 {2, 4 . . . , 10} and maxnodes 2 {70, 80, 90, 100, 110, 120}. For the analysis we fit the random
forest predictor on the data of 2010 (restricted to full durations) and evaluated the mean out of
sample deviance on the data of 2011. The results are shown in Figure 7.3, where the performance is
visualized graphically from best (white) to worst (dark blue). This suggests to use the parameters
0.71790.718020.718140.718270.718390.718510.718630.718760.718880.7190.7191370 80 90 100 110 120
2
4
6
8
10
maxnodes
mtry
Figure 7.3 – Improvement of the random forest predictors versus the Null model.
mtry = 6 and maxnodes = 120. We clearly see that mtry = 2 is too small, however from Figure
7.3 we do not see a large di↵erence in the performance when varying mtry 2 {4, 6, 8, 10}. Note that
![Page 122: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/122.jpg)
7.3. RANDOM FOREST FOR REAL DATA CLAIMS FREQUENCY 114
our problem is still rather low-dimensional with only 11 covariates. The benefit due to the choice of
mtry may increase when considering a larger set of explanatory variables.
In order to compare the random forest method to the other predictors, Table 7.3 shows the out
of sample performance for the optimal random forest predictor, the optimal bagged tree predictor
(see Section 6.3.3), GLM and Null model. In Figure 7.4 the improvements versus the Null model are
shown. We conclude that the random forest predictor has a similar predictive capacity, however it
seems that there is a larger variability between the years.
Mean Out of Sample Deviance
Year 2011 2012 2013 2014
rf (maxnodes = 120, mtry = 8) 0.7179291 0.6829232 0.7698701 0.7738687Bagging 0.7174013 0.6843935 0.7678068 0.7715305GLM 0.7172632 0.6843318 0.7688065 0.7715159Null 0.7290833 0.6957688 0.7808736 0.7842622
Table 7.3 – Out of sample performance for random forest predictors based on the data set restricted tofull durations only. The best method for each year is highlighted in bold face.
●
●
●
●0.01
050.
0115
0.01
25
Year
Impr
ovem
ent v
ersu
s N
ull M
odel
2011 2012 2013 2014
●
●
●
●
●
●
●
●
randomForest (maxnodes = 120, mtry = 6)Bagging (maxnodes = 80, large)GLM
Figure 7.4 – Improvements of the mean out of sample deviance versus the Null model for the multivariateclaims frequency analysis.
7.3.2 Visualization
In the previous section, we have fit a random forest predictor to the data restricted to observations
with a full duration. The obtained predictor is able to compete with the bagged tree predictor as
well as the GLM model in terms of mean out of sample deviance. However both due to reasons
explained in Section 6.3.4 but also in order to understand the variability observed in Figure 7.4, we
![Page 123: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/123.jpg)
7.4. SUMMARY 115
show the partial dependence resp. individual conditional expectation plots for the random forest
predictor. Figure 7.5 shows the numerical covariates and Figure 7.6 the categorical covariates for the
predictor rf(maxnodes = 120, mtry = 8). Looking at the estimated partial dependence function, we
can conclude that the random forest predictor is also able to capture the characteristics of the Motor
Insurance Collision Data Set. However the individual conditional expectation curves reveal that for
same covariate levels the random forest predictor fits unusual high frequencies, e.g. for young drivers
the frequencies go up to 60%. It is likely that these large values lead to the variability observed in
Figure 7.4. Hence restricting the data to full durations does not lead to a random forest predictor
which can compete with the bagged tree predictor.
Figure 7.5 – Individual conditional expectation plots for numerical covariates.
7.4 Summary
In the present chapter, we applied the random forest algorithm to insurance claims count data. Using
simulated data, we showed that the tuning parameter maxnodes for the random forest algorithm
can be chosen similar to the maxnodes parameter of the bagged tree predictor. For choosing the
parameter mtry, we used a grid search procedure to calibrate the model. We only observed small
di↵erences when varying mtry on our data set. With respect to modeling insurance claims data,
the implementation does unfortunately not provide a meaningful way to include the underlying
volumes of the observations into the model. The random forest implementation in the R package
randomForest is based on minimizing the mean squared error without an option to modify the
![Page 124: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/124.jpg)
7.4. SUMMARY 116
Figure 7.6 – Individual conditional expectation plots for categorical covariates.
optimization criterion. For that reason, when applying the method to a real data set, such as the
Motor Insurance Collision Data Set, the predicted values are highly distorted by the large frequencies
due to small durations. A possible solution to this issue is to restrict the data to observations with a
full duration of one year at risk only. Although the predictive capacity measured by the out of sample
deviance was in the range of a GLM model and the bagged tree predictor, the ICE plots revealed
that the method predicts unusual large values for certain covariates values. Note that the benefits
of random forest may only fully unfold if a larger number of explanatory variables is considered.
Our multivariate analysis only contained 11 covariates for which the improvements of using random
forest were not su�ciently large to overcome the drawback of using the mean squared error instead
of the Poisson deviance statistics for growing an individual random forest tree.
![Page 125: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/125.jpg)
Chapter 8
Summary
In this master thesis we have analyzed tree-based algorithms to predict the expected claims fre-
quency of a policyholder of a non-life insurance product. We have looked at three di↵erent methods:
regression tree, bagging and random forest. From the basics of non-life insurance pricing, we observe
that is it crucial to take the underlying volume into account. Therefore, we introduced the Poisson
deviance statistics as a measure of goodness. Conceptually, the tree algorithm can be easily adap-
ted to other performance measures. Fortunately, there already exists an e�cient implementation
within the rpart package for growing regression trees based on the deviance statistics derived from
the Poisson distribution. This extension showed to be a reasonable approach to the problem of
modeling claims frequencies. In order to deal with the degenerate case, the rpart implementation
assumes a Poisson-Gamma model. However the calibration of both the prior mean and the shrinkage
parameter is not optimal from a credibility point of view. Furthermore, additional flexibility of the
implementation would be desirable. In general the package does not provide detailed documentation
and hence makes it less attractive to many potential users. Sections 3.2.1 and 5.1 provide a detailed
introduction to Poisson regression trees with respect to the rpart implementation. In the empirical
analysis, univariate regression trees provided an analytical tool to form subportfolios with similar
risk characteristics. However multivariate regression trees did not provide competitive alternatives
to GLM models in terms of out of sample performance. In order to improve the performance, we
have analyzed the variance reduction technique bagging. Although no decomposition of the Pois-
son deviance statistics into bias and variance is known, we have observed a significant performance
improvement both on simulated and real world data. Overall the bagged tree predictor provides
a competitive alternative to a GLM model with the advantage that the underlying structure does
not need to be provided in advanced. In order to further improve the bagged tree predictor, we
then evaluated the performance of the random forest algorithm, which is based on a further variance
reduction by reducing the correlation between the individual trees within the ensemble. However, so
far there does not exist an implementation using the Poisson deviance statistics rather then the mean
squared error for growing the individual trees. As a consequence, we were not able to meaningfully
117
![Page 126: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/126.jpg)
8.1. FUTURE WORK 118
include the volume measure into the model and therefore random forest did not provide a competit-
ive alternative. Since bagging and random forest are black box models, we considered several model
visualization techniques. The results supported the fact that the bagged tree predictor is able to
recover all characteristics of an insurance claims data set.
8.1 Future Work
This thesis put a strong focus on the empirical analysis of tree-based methods in the context of insur-
ance claims data based on the theoretical understanding described in the literature. The structure
of the data lead to a modification of the classical algorithm to the Poisson deviance statistics rather
than the mean squared error. However the theoretical results on performance improvement due to
bagging and random forest are crucially based on the use of the mean squared error. Nonetheless,
when using the Poisson deviance statistics we have observed a significant improvement. Further
research may be able to establish a theoretical foundation for our findings. From an implementation
perspective, further improvements of the bagged tree predictor may be possible when implementing
the random forest algorithm based on the Poisson deviance statistics as a measure of performance
for growing the individual trees.
Regarding the regression tree method, the Poisson option assumes a Poisson-Gamma model to
avoid the problem of degeneracy. However further research is needed on how the Poisson-Gamma
model needs to be calibrated in a tree model for insurance claims data.
Furthermore, we have seen that the black box procedure bagging was able to compete with a
GLM model in terms of predictive capacity. Therefore, one may also consider di↵erent methods such
as neural networks or support vector machines for further research. However our work suggests that
being able to include the duration is a crucial component of a reasonable frequency model.
Finally, as described by equation (4.1), there are two components to the pricing of a non-life
insurance product: The expected claims frequency as well as the expected claim severity. This
work only focused on the frequency aspects. However we expect that large parts of the results are
also adaptable to the severity issue. A measure of performance may be derived from the Gamma
distribution, which is also commonly used within GLM models for the claim severity.
![Page 127: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/127.jpg)
Appendices
![Page 128: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/128.jpg)
Appendix A
Proofs
A.1 Proof of Equation (3.2)
Proof. We need to find cm
such that
cm
= argminc
m
2R
X
{i:xi
2Rm
}
(yi
� cm
)2.
Di↵erentiation with respect to cm
and evaluation at zero gives,
@
@cm
X
{i:xi
2Rm
}
(yi
� cm
)2 = (�2)X
{i:xi
2Rm
}
(yi
� cm
) = 0.
Solving for cm
ends the proof.
A.2 Proof of Proposition 3.7
Proof. Let T be a tree having n leaves with root T0 and primary branches TL
and TR
(see Figure
A.1). We use the following notation T = T0+TL
+TR
. Assuming existence of T (↵), then uniqueness
follows by choosing the smallest one within a finite set of possible subtrees. The proof of existence
follows by induction. The assertion is clear for |T | = 1. Suppose it is true for all trees having fewer
than n � 2 leaves. Note that
C↵
(T (↵)) = min{C↵
(T0), C↵
(TL
(↵)) + C↵
(TR
(↵))},
where TL
(↵) and TR
(↵) are the smallest optimal subtrees of TL
resp. TR
. Since either C↵
(T0) C↵
(TL
(↵))+C↵
(TR
(↵)), then T (↵) = T0 or otherwise T (↵) = T0+TL
(↵)+TR
(↵). However we have
|TL
| < n and |TR
| < n, hence by hypothesis TL
(↵) and TR
(↵) are unique and exist. It remains to
show that for ↵1 ↵2, T (↵2) ⇢ T (↵1). We again show this by induction. Since for larger ↵, we
120
![Page 129: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/129.jpg)
A.3. PROOF OF EQUATION (3.7) 121
T0
TL TR
Figure A.1 – Illustration of the primary branches TL and TR of T .
prune towards smaller trees, it follows that if |T (↵1)| = 1 then also |T (↵2)| = 1. Again assuming
that the statement holds true for all trees having fewer than n � 2 leaves. Using the same argument
as above, either T (↵2) = T0, hence we are done, or T (↵2) = T0 + TL
(↵2) + TR
(↵2). Since both TL
and TR
have fewer than n leaves, it holds by hypothesis that TL
(↵2) ⇢ TL
(↵1) and TR
(↵2) ⇢ TR
(↵1).
This implies that T (↵2) ⇢ T (↵1).
A.3 Proof of Equation (3.7)
Proof. In order to calculate the deviance statistics for y = (N1, . . . , Nn
), with independent Ni
=
N |X=x
i
⇠ Poi(�(xi
)vi
), we need to express the log-likelihood as a function of the mean vector
µ = (E(N1), . . . ,E(Nn
)) = (�(x1)v1, . . . ,�(xn
)vn
). It follows that
`(µ) =nX
i=1
log
✓(�(x
i
)vi
)Ni
Ni
!exp(��(x
i
)vi
)
◆
=nX
i=1
Ni
log(�(xi
)vi
)� log(Ni
!)� �(xi
)vi
.
Therefore the deviance statistics is given by
D(y,µ) = 2(`(y)� `(µ))
= 2nX
i=1
(Ni
log(Ni
)� log(Ni
!)�Ni
�Ni
log(�(xi
)vi
) + log(Ni
!) + �(xi
)vi
)
![Page 130: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/130.jpg)
A.4. PROOF OF EQUATION (3.12) 122
= 2nX
i=1
Ni
log
✓N
i
�(xi
)vi
◆� (N
i
� �(xi
)vi
).
A.4 Proof of Equation (3.12)
Proof. We need to find cm
such that
cm
= argminc
m
2R
X
{i:xi
2Rm
}
wi
(yi
� cm
)2.
Di↵erentiation with respect to cm
and evaluation at zero gives,
@
@cm
X
{i:xi
2Rm
}
wi
(yi
� cm
)2 = (�2)X
{i:xi
2Rm
}
wi
(yi
� cm
) = 0.
Solving for cm
ends the proof.
A.5 Proof of Remark 4.6
Proof. Recall that N i =P
m
i
j=1Ni
j
and vi =P
m
i
j=1 vj . Then it follows that
P
N i
vi2"�i
� �i
V co
N i
vi
!,�
i
+ �i
V co
✓N i
vi
◆#!
= P
✓N i
vi �
i
+ �i
V co
✓N i
vi
◆◆� P
✓N i
vi �
i
� �i
V co
✓N i
vi
◆◆
= P
✓N i � vi�
i
�i
V co(N i) 1
◆� P
✓N i � vi�
i
�i
V co(N i) �1
◆
= P
0
BB@
Pm
j
j=1Ni
j
� E⇣P
m
j
j=1Ni
j
⌘
rV ar
⇣Pm
j
j=1Ni
j
⌘ 1
1
CCA� P
0
BB@
Pm
j
j=1Ni
j
� E⇣P
m
j
j=1Ni
j
⌘
rV ar
⇣Pm
j
j=1Ni
j
⌘ �1
1
CCA
⇡ �(1)� �(�1) ⇡ 70%,
where we used the central limit theorem in the fourth step.
![Page 131: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/131.jpg)
A.6. PROOF OF LEMMA 5.3 123
A.6 Proof of Lemma 5.3
Proof. Recall for N i
j
⇠ NegBin(�i
vj
, �) we have that E(N i
j
) = �i
vj
and V ar(N i
j
) = �i
vj
(1+�i
vj
/�).
Then the expression for the coe�cient of variation in Lemma 5.3 follows by a direct calculation,
V co
✓N i
vi
◆=
vuutV ar
Pm
j
j=1Ni
jPm
i
j=1 vj
!�E P
m
j
j=1Ni
jPm
i
j=1 vj
!
=
vuutm
jX
j=1
V ar⇣N i
j
⌘� m
jX
j=1
E�N i
j
�
=
vuuut1
�i
Pm
i
j=1 vj+
1
�
Pm
i
j=1(�i
vj
)2⇣�i
Pm
i
j=1 vj
⌘2
=
vuuut1
�i
Pm
i
j=1 vj+
1
�
Pm
i
j=1 v2j⇣P
m
i
j=1 vj
⌘2 , (A.1)
where we used the independence of the claims counts in the second equality. Next we show that
Pm
i
j=1 v2j⇣P
m
i
j=1 vj
⌘2m
i
!1! 0.
Since there exists � > 0, M < 1 such that 0 < � vj
M < 1, the have that
0 P
m
i
j=1 v2j⇣P
m
i
j=1 vj
⌘2 P
m
i
j=1M2
⇣Pm
i
j=1 �⌘2
=m
i
M2
m2i
�2! 0 for m
i
! 1
Together withP
m
i
j=1 vjm
i
!1! 1, we get that equation (A.1) converges to 0.
![Page 132: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/132.jpg)
Appendix B
Correlation Plot for Additional
Covariates
●
●
● ●
−1
−0.8
−0.6
−0.4
−0.2
0
0.2
0.4
0.6
0.8
1
Horse Power
Price
Weight
Age
Policy Age
0.89
0.63
−0.04
−0.09
0.59
−0.02
−0.09
−0.03
−0.07 0.58
Figure B.1 – Correlations between original and additionally included covariates.
124
![Page 133: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/133.jpg)
Appendix C
R Code
C.1 Variable Importance
varImp <- function(object, ...){tmp <- rownames(object$splits)
allVars <- colnames(attributes(object$terms)$factors)
if(is.null(tmp)){out<-NULL
zeros <- data.frame(x = rep(0, length(allVars)),
Variable = allVars)
out <- rbind(out, zeros)
}else{rownames(object$splits) <- 1:nrow(object$splits)
splits <- data.frame(object$splits)
splits$var <- tmp
splits$type <- ""
frame <- as.data.frame(object$frame)
index <- 0
for(i in 1:nrow(frame)){if(frame$var[i] != "<leaf>"){
index <- index + 1
splits$type[index] <- "primary"
if(frame$ncompete[i] > 0){
125
![Page 134: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/134.jpg)
C.1. VARIABLE IMPORTANCE 126
for(j in 1:frame$ncompete[i]){index <- index + 1
splits$type[index] <- "competing"
}}if(frame$nsurrogate[i] > 0){for(j in 1:frame$nsurrogate[i]){
index <- index + 1
splits$type[index] <- "surrogate"
}}
}}splits$var <- factor(as.character(splits$var))
splits <- subset(splits, type != "competing")
out <- aggregate(splits$improve,
list(Variable = splits$var),
sum,
na.rm = TRUE)
allVars <- colnames(attributes(object$terms)$factors)
if(!all(allVars %in% out$Variable)){missingVars <- allVars[!(allVars %in% out$Variable)]
zeros <- data.frame(x = rep(0, length(missingVars)),
Variable = missingVars)
out <- rbind(out, zeros)
}}out2 <- data.frame(Overall = out$x)
or <- order(out2$Overall, decreasing = T)
out3 <- data.frame(Overall = out2[or,])
rownames(out3) <- out$Variable[or]
out3
}
![Page 135: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/135.jpg)
C.2. STRATIFIED CROSS-VALIDATION 127
C.2 Stratified Cross-Validation
data <- data[order(data$freq),] #Order the data first. Then we do not need
#to switch between different orderings
tree <- rpart(cbind(jr,nc) ~ age+price, data=data, method="poisson",
parms = list(shrink = 3),
control=rpart.control(minbucket=4000, cp=0.00005))
cv_customized <- function(K = 10, tree, data){stopifnot(is.unsorted(data$freq) == F) #check if data is sorted
set.seed(2)
xgroup <- rep(1:K, length = nrow(data))
xfit <- xpred.rpart(tree, xgroup)
n_subtrees <- dim(tree$cptable)[1]
std <- numeric(n_subtrees)
err <- numeric(n_subtrees)
err_group <- numeric(K)
for(i in 1:n_subtrees){for(k in 1:K){ind_group <- which(xgroup == k)
err_group[k] <- 2*sum(log((data[ind_group, "nc"]/
(xfit[ind_group,i]*data[ind_group, "jr"]))^
data[ind_group, "nc"])-
(data[ind_group, "nc"]-xfit[ind_group,i]*
data[ind_group, "jr"]))
}std[i] <- sd(err_group)/sqrt(K)
err[i] <- mean(err_group)
}return(list("err"=err, "std" = std))
}
![Page 136: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/136.jpg)
C.3. SIGNIFICANCE TEST PRUNING 128
C.3 Significance Test Pruning
test.prune <- function(tree, data.test){pred.test <- predict(tree, newdata = data.test)
pred.class <- factor(pred.test)
ind_leaves <- which(tree$frame$var == '<leaf>')
leaf_nr <- as.numeric(rownames(tree$frame)[ind_leaves])
#Identifying relevant leaves for pruning:
left <- which(leaf_nr %% 2 == 0)
leaf_nr_left <- leaf_nr[left]
leaf_nr_left <- leaf_nr_left[which(leaf_nr[left+1]-leaf_nr[left] == 1)]
leaf_nr_right <- leaf_nr_left+1
bool_prune <- numeric(length(leaf_nr_right))
for(i in 1:length(leaf_nr_left)){yval_left <- tree$frame$yval[which(rownames(tree$frame) ==
as.character(leaf_nr_left[i]))]
yval_right <- tree$frame$yval[which(rownames(tree$frame) ==
as.character(leaf_nr_right[i]))]
ind_left <- which(pred.test == yval_left)
ind_right <- which(pred.test == yval_right)
test <- wilcox.test(data.test$nc[ind_left],
data.test$nc[ind_right],
alternative = 'two.sided',
paired = FALSE, exact = T)
bool_prune[i] <- test$p.value > 0.05
}
nodes_prune <- rownames(tree$frame)[which(rownames(tree$frame) %in%
leaf_nr_left[which(bool_prune
== 1)])-1]
tree_prune <- tree
if(length(nodes_prune) >= 1){for(k in 1:length(nodes_prune)){
![Page 137: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/137.jpg)
C.3. SIGNIFICANCE TEST PRUNING 129
tree_prune <- snip.rpart(tree_prune,
toss = as.numeric(nodes_prune[k]))
}}tree <- tree_prune
cat('Number of pruned leaves: ', sum(bool_prune), '\n')return(list('tree' =tree, 'n_prune' = sum(bool_prune)))
}
#Minimal Example for the use of the test.prune function,
#based on drawing bootstrap samples from the training data set.
tree <- rpart(cbind(jr, nc) ~ age + price, data = data, method="poisson",
parms = list(shrink = 3.5),
control=rpart.control(minbucket=1000, cp=0.00001))
set.seed(222)
tree.stat <- tree
B <- 0
while(B <10){data.test <- data[sample.int(n = nrow(data), size = nrow(data),
replace = T),]
prune <- test.prune(tree.stat, data.test = data.test)
tree.stat <- prune[[1]]
if(prune[[2]] == 0) B <- B+1
if(prune[[2]] > 0) B <- 0
}
![Page 138: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/138.jpg)
C.4. BAGGING 130
C.4 Bagging
bagging <- function(formula, data, maxnodes, ntree, ...){require(rpart)
splits <- maxnodes-1
n <- nrow(data)
trees <- list()
for(b in 1:ntree){cat('Number of bagged Tree: ', b, '\n')b.ind <- sample.int(n, size = n, replace = T)
data.b <- data[b.ind,]
tmp <- rpart(formula, data = data.b, ...)
cp <- tmp$cptable[tail(which(tmp$cptable[, "nsplit"] <= splits),
n=1), "CP"]
cat('Number of Leaves: ',
tmp$cptable[tail(which(tmp$cptable[, "nsplit"] <= splits),
n=1), "nsplit"]+1, '\n')trees[[b]] <- prune(tmp, cp = cp)
}return(trees)
}
![Page 139: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/139.jpg)
Appendix D
Variable Description
Variable Name ............................. Type ............................. Description
ac Numerical Claim amount
age Numerical Age of driver
beg Date Start date of record
bonus Categorical Bonus protection
car age Numerical Years since first licensing of car
car class Categorical Car classification
deductible Numerical Deductible of policy
end Date End date of record
kg Numerical Weight of car
km Numerical Kilometres yearly driven
leasing Categorical 0: not leasing, 1: leasing
nat Categorical Nationality
nc Numerical Number of claims
o�ce Categorical Regional o�ce to which policy is assigned
plz Categorical Zip code of place of residence of driver
pol age Numerical Age of policy
pol Categorical Identifier for every policyholder at AXA
price Numerical Price of car
ps Numerical Horse power of car
region Categorical Region classification used by AXA
sex Categorical Sex of driver
yr Numerical Duration (years at risk)
Table D.1 – List of variables used in this thesis.
131
![Page 140: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/140.jpg)
Bibliography
R. A. Berk. Statistical Learning from a Regression Perspective. Springer Series in Statistics. Springer
New York, 2008.
P. Buhlmann and M. Machler. Lecture Notes on Computational Statistics. Seminar fur Statistik. ETH
Zurich, 2014. URL https://stat.ethz.ch/education/semesters/ss2015/CompStat/sk.pdf.
L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.
L. Breiman. Random forest. Machine Learning, 45(1):5–32, 2001.
L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees.
Wadsworth, 1984.
H. Buhlmann and A. Gisler. A Course in Credibility Theory and its Applications. Universitext.
Springer Berlin Heidelberg, 2005.
B. Efron. Bootstrap methods: Another look at the jackknife. Ann. Statist., 7(1):1–26, 1979.
A. Goldstein, A. Kapelner, J. Bleich, and E. Pitkin. Peeking inside the black box: Visualizing
statistical learning with plots of individual conditional expectation. 2014. URL http://arxiv.
org/abs/1309.6392.
T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in
Statistics. Springer New York, 2001.
G. James, D. Witten, T. Hastie, and R. Tibshirani. An Introduction to Statistical Learning: with
Applications in R. Springer Texts in Statistics. Springer New York, 2014.
R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection.
Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 1137–
1143, 1995.
M. LeBlanc and J. Crowley. Relative risk trees for censored survival data. Biometrics, 48(2):411–425,
1991.
132
![Page 141: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/141.jpg)
BIBLIOGRAPHY 133
A. Negassa, A. Ciampi, M. Abrahamowicz, S. Shapiro, and J.-F. Boivin. Tree-structured prognostic
classification for censored surivial data: validation of computationally inexpensive model selection
criteria. Journal of Statistical Computation and Simulation, 67(4):289–317, 2000.
E. Ohlsson and B. Johansson. Non-Life Insurance Pricing with Generalized Linear Models. EAA
Series. Springer Berlin Heidelberg, 2010.
C. Shalizi. Lecture notes on data mining. 2009. URL http://www.stat.cmu.edu/~cshalizi/350.
T. Therneau, B. Atkinson, and B. Ripley. rpart: Recursive Partitioning and Regression Trees, 2015.
URL http://CRAN.R-project.org/package=rpart. R package version 4.1-9.
M. V. Wuthrich. Non-Life Insurance: Mathematics & Statistics. SSRN, 2015. URL http://ssrn.
com/abstract=2319328.
![Page 142: Data Science in Non-Life Insurance Pricing · Predicting Claims Frequencies using Tree-Based Models ... priori knowledge of the structure and can also be used when a large number](https://reader033.vdocument.in/reader033/viewer/2022060509/5f2498da36ace52d0a42ccac/html5/thumbnails/142.jpg)