data science in non-life insurance pricing · predicting claims frequencies using tree-based models...

Data Science in Non-Life Insurance Pricing

Predicting Claims Frequencies using Tree-Based Models

Master’s Thesis

Submitted for the degree of Master of Science ETH in Mathematics

Author

Patrick Zochbauer

Supervisor

Prof. Dr. Mario V. WuthrichDepartment of Mathematics

ETH Zurich

Co-Supervisor

Dr. Christoph BuserAXA Winterthur

Date of Submission: 22. July 2016

Abstract

Today, generalized linear models (GLM) are the standard methods in pricing of non-life insurance

products. However they require detailed knowledge about the structure of the data, which needs

to be provided a priori by the pricing actuary. In this thesis, we present more flexible tree-based

methods, that do not su↵er from this drawback. In the first part, we discuss their theoretical

properties. In the second part, we apply the methods to the problem of predicting the claims

frequency of a policyholder. The empirical analysis is based on the Motor Insurance Collision Data

Set of AXA Winterthur. We conclude that the regression tree methodology provides a useful tool

for building homogeneous subportfolios with similar risk characteristics. The more advanced tree

ensemble methods such as bagging and random forest provide competitive alternatives to GLM

models in terms of predictive capacity.

I

Acknowledgement

I would like to express my profound gratitude to Prof. Dr. Mario V. Wuthrich for supervising

this master thesis. Our numerous and inspiring discussions truly deepened my understanding of

mathematics as well as its applications to actuarial sciences. I am also very thankful for Dr. Christoph

Buser for co-supervising this thesis and providing the possibility to write this work in cooperation with

AXA Winterthur. In that spirit, the previous acknowledgement also belongs to AXA Winterthur,

in particular to Peter Reinhard and the whole pricing team, for allowing me to be their guest and

providing a delightful atmosphere as well as invaluable support.

Finally, I thank my family and friends for their confidence and constant support during all my

time at ETH Zurich.

II

Contents

1 Introduction 1

I Theory 3

2 Statistical Learning Theory 4

2.1 Mathematical Framework and Terminology . . . . . . . . . . . . . . . . . . . . . . . . 4

2.2 Prediction and Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.1 Prediction . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 5

2.2.2 Inference . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3 Generalization Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 7

2.3.1 Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 8

2.3.2 Bootstrap . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 9

2.4 Maximum Likelihood Method . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 10

3 Tree-Based Models 14

3.1 Basics of Tree-Based Models . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 14

3.1.1 Tree Terminology . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15

3.1.2 Growing a Regression Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 16

3.1.3 Cost-Complexity Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 17

3.2 Modifications . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.1 Poisson Regression Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 22

3.2.2 Weighted Least Squares . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 25

3.2.3 Classification Tree . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 27

3.3 Additional Features . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

3.3.1 Categorical Covariates . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28

III

CONTENTS IV

3.3.2 Missing Data and Surrogate Splits . . . . . . . . . . . . . . . . . . . . . . . . . 29

3.3.3 Variable Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 30

3.4 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4.1 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32

3.4.2 Bagged Tree Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 34

3.4.3 Random Forest . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 35

3.4.4 Model Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 38

II Data Analysis 42

4 Motor Insurance Collision Data 43

4.1 Non-Life Insurance Pricing . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 43

4.2 Motor Insurance Collision Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 45

4.2.1 Data Management . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 46

4.2.2 Descriptive Statistics . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 47

4.2.3 Multivariate Analysis . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 55

4.3 Simulated Insurance Claims Count Data . . . . . . . . . . . . . . . . . . . . . . . . . . 56

4.3.1 Negative-Binomial Distribution . . . . . . . . . . . . . . . . . . . . . . . . . . . 57

4.3.2 Choice of Parameter . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 58

5 Data Analysis: Tree Model 60

5.1 Fitting a Tree Model in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 60

5.2 Regression Tree for Claims Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . . 62

5.3 Regression Tree for Simulated Claims Frequency . . . . . . . . . . . . . . . . . . . . . 63

5.3.1 Choice of Parameter for Poisson Tree Predictor . . . . . . . . . . . . . . . . . . 64

5.3.2 Stratified Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 66

5.3.3 Optimal Tree Size . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 68

5.3.4 Tree Size Selection via Statistical Significance Pruning . . . . . . . . . . . . . . 73

5.3.5 Variability of Data . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 75

5.3.6 Variable Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.3.7 Stability . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 79

5.4 Univariate Regression Tree for Real Data Claims Frequency . . . . . . . . . . . . . . . 81

5.5 Bivariate Regression Tree for Real Data Claims Frequency . . . . . . . . . . . . . . . . 89

CONTENTS V

5.5.1 Tree Size Selection . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 89

5.5.2 Comparison: Tree and GLM . . . . . . . . . . . . . . . . . . . . . . . . . . . . 90

5.5.3 Out of Sample Performance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 92

5.6 Multivariate Regression Tree for Real Data Claims Frequency . . . . . . . . . . . . . . 94


5.7 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 97

6 Data Analysis: Bagging 99

6.1 Fitting a Bagged Tree Predictor in R . . . . . . . . . . . . . . . . . . . . . . . . . . . . 99

6.2 Bagging for Simulated Claims Frequency . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.2.1 Tree Size of Individual Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . 100

6.2.2 Number of Trees . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 102


6.3 Bagging for Real Data Claims Frequency . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.3.1 Bivariate Bagged Tree Predictor . . . . . . . . . . . . . . . . . . . . . . . . . . 103

6.3.2 Multivariate Bagged Tree Predictor . . . . . . . . . . . . . . . . . . . . . . . . 104


6.3.4 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 106

6.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 109

7 Data Analysis: Random Forest 110

7.1 Fitting a Random Forest Predictor in R . . . . . . . . . . . . . . . . . . . . . . . . . . 110

7.2 Random Forest for Simulated Claims Frequency . . . . . . . . . . . . . . . . . . . . . . 110

7.3 Random Forest for Real Data Claims Frequency . . . . . . . . . . . . . . . . . . . . . 111

7.3.1 Restriction to Full Durations . . . . . . . . . . . . . . . . . . . . . . . . . . . . 113

7.3.2 Visualization . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 114

7.4 Summary . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 115

8 Summary 117

8.1 Future Work . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 118

Appendices

A Proofs 120

A.1 Proof of Equation (3.2) . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120

CONTENTS VI

A.2 Proof of Proposition 3.7 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 120



A.5 Proof of Remark 4.6 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 122

A.6 Proof of Lemma 5.3 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 123

B Correlation Plot for Additional Covariates 124

C R Code 125

C.1 Variable Importance . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 125

C.2 Stratified Cross-Validation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 127

C.3 Significance Test Pruning . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 128

C.4 Bagging . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 130

D Variable Description 131

Chapter 1

Introduction

The amount of data available in the world has been exploding over the last decade. Being able

to process this flood of data is one major challenge people and businesses are faced with today.

The potential benefits of data analytics are present in any industry, and so they are in the insurance

sector. The heart of an insurance company is based on underwriting risks. Thereby actuaries need to

evaluate the potential losses transfered to the insurance company and set a premium accordingly. The

data used to validate these decisions is growing both in the number of explanatory variables and the

amount of data points available. The standard tools to process this data are generalized linear models

(GLM). The GLM methodology requires a detailed a priori knowledge about the structure of the

data. If the structure is known, they provide high predictive capacity. However recent development

towards an extensive collection of not only more granular data but also the collection of more and

more variables raises the need for more flexible methods. In this thesis, alternative approaches,

called tree models, are discussed in details. Tree models do not su↵er from the limitation of an a

priori knowledge of the structure and can also be used when a large number of potential variables

is available. We use the Motor Insurance Collision Data Set of AXA Winterthur to analyze and

compare the models. Since the data is already extensively studied and well understood by AXA

Winterthur, their GLM model provides an optimal benchmark. This thesis is organized in two parts:

Theory and Data Analysis. We begin with a discussion on the basic notions of statistical learning in

Chapter 2. Chapter 3 introduces the tree-based models regression tree, bagging and random forest.

The second part starts with an introduction to the fundamentals of non-life insurance pricing and a

description of the Motor Insurance Collision Data Set in Chapter 4. We define the pure premium of

an insurance product as

Pure Premium = Expected Claims Frequency · Expected Claim Severity.

Based on this equation, actuaries model the two quantities claims frequency and claim severity

separately. In this thesis, we only focus on the claims frequency models. Chapters 5 to 7 are

dedicated to analyzing and applying the models introduced in the first part to insurance claims

1

2

data. Thereby a strong focus is put on understanding the individual tree predictor, since it serves

as the key building block for both bagging and random forest. To fully understand the benefits and

drawbacks of the various models, the data analysis was conducted both on simulated data and the

data set of AXA Winterthur.

Part I

Theory

Chapter 2

Statistical Learning Theory

We start with a short introduction to the main statistical concepts that we will encounter in this

thesis. The chapter is based on Buhlmann and Machler (2014), James et al. (2014) as well as

Wuthrich (2015).

2.1 Mathematical Framework and Terminology

Let (⌦,F , P ) be a probability space containing the random variables X 2 Rp and " 2 R with

E(") = 0. Suppose the data point (X, Y ) 2 Rp ⇥ R is modeled by the relationship

Y = f(X) + ", (2.1)

where f : Rp ! R is a deterministic, but unknown function of X = (X1, . . . , Xp

). The variables

X1, . . . , Xp

are called covariates, predictor or explanatory variables. The function f is interpreted

to capture all systematic e↵ects of the covariates X1, . . . , Xp

on the response Y . The term " can

be understood as a random noise accounting for the measurement error or inability of the model to

capture all underlying non-systematic e↵ects. Using E(") = 0, it follows that f(x) = E(Y |X = x).

Given a training/learning set of i.i.d. previously observed data points L = {(xi

, yi

) : i = 1, . . . , n}drawn from (X, Y ), our goal is to find an estimate f(x,L) of the unknown function f based on

L. We drop the argument L and write fn

(x), if there is no ambiguity regarding the training set L.If the response Y is continuous, we refer to the problem of determining an optimal estimate f

n

by

regression.

We use the convention that random variables are denoted by capital and their realizations by

small letters. To simplify notation, vectors x = (x1, . . . , xp) 2 Rp are frequently written in bold face.

Example 2.1. The best known example for regression is linear regression, where we assume that

f(x1, . . . , xp) = �0 + �1x1 + . . .+ �p

xp

,

4

2.2. PREDICTION AND INFERENCE 5

for unknown parameters �0, . . . ,�p 2 R. Given the training set L = {(xi

, yi

) : i = 1, . . . , n} with

x

i

= (xi1, . . . , xip), an estimate f

n

(x1, . . . , xp) = �0+ �1x1+ . . .+ �p

xp

can be obtained by minimizing

the squared error

min�0,...,�p

nX

i=1

(yi

� fn

(xi1, . . . , xip))

2. (2.2)

Figure 2.1 shows an example of linear regression for p = 1. The training data set L = {(xi

, yi

) : i =

1, . . . , 25} was generated from

Y = 1 +1

2X + ",

where X ⇠ U [0, 100] and " ⇠ N (0, 100).

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

0 20 40 60 80 100

−20

020

4060

x

y =

f(x)

Figure 2.1 – An example for linear regression with p = 1. The true underlying function f(x) = 1+ 0.5xis indicated by the red line. The blue line shows the linear regression estimate f25 determined by (2.2)from n = 25 i.i.d. observations (xi, yi).

?

2.2 Prediction and Inference

2.2.1 Prediction

In prediction we face the problem of forecasting the outcome of a random response variable Y ⇠ P

for given covariates x1, . . . , xp. For this problem, we specify a predictor Y of Y = f(x1, . . . , xp) + ",

e.g. by setting Y = fn

(x1, . . . , xp). In order to assess the quality of our prediction, we need a measure

2.2. PREDICTION AND INFERENCE 6

of goodness. The most commonly and widely used measure, called mean squared error, is defined by

E⇣(Y � Y )2

⌘.

Proposition 2.2. Let (⌦,F , P ) be a probability space containing the i.i.d. training data set L =

{(Xi

, Yi

) : i = 1, . . . , n} as well as an i.i.d. data point (X, Y ). Assume both the data point (X, Y )

and the learning set L follow the regression model (2.1). For any predictor Y = f(X,L), we have

the following decomposition of the conditional mean squared error

E✓⇣

Y � Y⌘2 ��X = x

◆= Bias

⇣f(x,L)

⌘2+ Var

⇣f(x,L)

⌘+ Var (") , (2.3)

where Bias⇣f(x,L)

⌘:= f(x)� E

⇣f(x,L)

⌘.

Proof. Using the regression model equation (2.1), we get that V ar(Y��X = x) = V ar("). Further-

more, conditionally given X, Y and f(X,L) are independent. Using this, the proof follows by a

direct calculation,

E✓⇣

Y � Y⌘2 ��X = x

◆= E

⇣Y 2 + f(X,L)2 � 2Y f(X,L)

��X = x

⌘

= E⇣Y 2��X = x

⌘� E (Y |X = x)2 + E

⇣f(x,L)2

⌘� E

⇣f(x,L)

⌘2

+ E(Y��X = x)2 + E

⇣f(x,L)

⌘2� 2E(Y |X = x)E

⇣f(x,L)

⌘

= V ar(Y��X = x) + V ar

⇣f(x,L)

⌘+⇣E(Y |X = x)� E

⇣f(x,L)

⌘⌘2

= Bias⇣f(x,L)

⌘2+ V ar

⇣f(x,L)

⌘+ V ar("),

where we used the conditional independence of Y and f(X,L) in the second and V ar(Y��X = x) =

V ar(") in the fourth equality.

The mean squared error is obtained by taking the expectation with respect to the covariates X, i.e.

E✓⇣

Y � Y⌘2◆

= E✓E✓⇣

Y � Y⌘2 ��X

◆◆

= E✓Bias

⇣f(X,L)|X

⌘2◆+ E

⇣Var

⇣f(X,L)|X

⌘⌘+ V ar("), (2.4)

where Bias⇣f(X,L)|X

⌘= f(X) � E

⇣f(X,L)|X

⌘. Proposition 2.2 gives more detailed insight

into the behavior of the mean squared error. Unfortunately, we cannot control the variance V ar(")

of the noise term. Hence, our goal will be to minimize the remaining two terms. Very often it is not

possible to minimize both simultaneously. This problem is commonly known as the bias-variance-

trade-o↵. Since in practice we neither know the distribution of Y nor the distribution of X, the

mean squared error cannot be calculated directly. Instead, we need to estimate it. Suppose we are

2.3. GENERALIZATION PERFORMANCE 7

given n observations (xi

, yi

) for i = 1, . . . , n. We use the following estimator for the (in-sample)

mean squared error

MSE =1

n

nX

i=1

(yi

� fn

(xi

))2. (2.5)

Note that we should carefully distinguish between prediction error on data that was used during the

estimation procedure and previously unseen data. For more details, see Section 2.3 below.

2.2.2 Inference

Whereas the objective of prediction was accuracy, the aim of inference is to understand the dis-

tribution P of the random response Y . Very often P involves unknown parameters which need to

estimated. For example, given the covariates X = x we can estimate the (unknown) mean of the

response Y in model (2.1) by an estimator µY

. If we choose µY

= fn

(x), then fn

(x) serves both as a

predictor for Y but also as an estimate for E(Y |X = x) the expectation of Y |X=x

. In the regression

problem (2.1), the unknown distribution P of Y depends on the covariates X. In practice, often only

few of the p predictor variables are relevant for the response Y . Understanding the structure and

dependencies, we can reduce the complexity of the model and therefore the variance of the predictor

f(X,L). By (2.4) this eventually also improves the mean squared error.

2.3 Generalization Performance

The generalization performance is a measure for the predictive capacity of a statistical method on

new, previously unseen test data. In general, to fit a method to a data set, one aims to minimize the

expected error for a given loss function, e.g.

E⇣L⇣Y, Y

⌘⌘= E

✓⇣Y � Y

⌘2◆,

with loss function L(Y, Y ) = (Y � Y )2, previously called mean squared error. For the sequel, we

use the squared error loss to illustrate all concepts. In general, the loss function L can be chosen

according to the problem, see e.g. Section 2.4. Since most methods are designed to minimize the

in-sample error, it is overoptimistic to use the in-sample error as an estimate for the generalization

performance as well. Caring about the performance on previously unseen data, we should use a

training set L to fit the predictor and evaluate the performance on new i.i.d. test data (Xnew

, Ynew

).

Based on that, we are interested in the expected loss given by

GE := E✓⇣

Ynew

� f(Xnew

,L)⌘2◆

= E✓E✓⇣

Ynew

� f(Xnew

,L)⌘2 ��L

◆◆, (2.6)


where we use f(Xnew

,L) as a predictor for Ynew

. The right-hand side of equation (2.6) states that

we first fix the training set L and consider the expected quadratic error of Ynew

for given predictor

f(Xnew

,L). In a second step, we average over all training set L, i.e. predictors f(Xnew

,L).

2.3.1 Cross-Validation

In practice, we need to estimate the generalization error since we usually neither know the distribution

of Y nor the distribution of X. Motivated by (2.6) we may partition the data into two independent,

i.e. disjoint sets: A training set L as well as a testing set T . However, there is usually only limited

amount of data available such that it is not possible to split the data into two reasonably large

disjoint sets. Cross-validation (CV) is a general method to estimate the generalization performance

whilst making maximal use of the data available. For our purpose, we are going to describe K-fold

CV: Assume we are given i.i.d. observations L = {(x1, y1), . . . , (xn

, yn

)} of (X, Y ). In order to get

both training and test data sets, we randomly partition the observations L into K roughly equally

sized subsets Bk

⇢ L such that Bi

\ Bj

= ; for i 6= j. We then use the k-th set for testing and the

remaining K � 1 sets to train the predictor denoted by f(x,L\Bk

). We do this for all k = 1, . . . ,K

and evaluate the performance on the set Bk

. This leads to the generalization error estimate

dGE =1

K

KX

k=1

1

|Bk

|X

{i:(xi

,y

i

)2Bk

}

(yi

� f(xi

,L\Bk

))2. (2.7)

Equation (2.7) can be seen as an estimate of the right-hand side of equation (2.6). There are several

possible choices to obtain the partition (Bk

)k=1,...,K . In the remainder, we will use the following two:

Classical: Most commonly, the partition (Bk

)k=1,...,K is obtained by drawing without replace-

ment from the data L.

Stratified: For stratified cross-validation, we first order the data according to a stratification

variable (e.g. the response). Then we form bins with the K smallest (according to the strat-

ification variable), then the K next smallest and so on. In the end, we draw from these bins

without replacement resulting in a partition of the data, for which all subsets Bk

have a similar

distribution of the stratification variable. For more details see Algorithm 5.1 below.

Note that in general the resulting estimators have a di↵erent distribution. Consider for example

the situation where the response variable y is used for stratification and the data L contains two

outliers, i.e. for all i 2 {1, . . . , n}\{k1, k2}, yi < yk1 and y

i

< yk2 . In stratified cross-validation, these

outliers are assigned to two di↵erent sets Bk

with probability one. On the other hand, in classical

cross-validation, the outliers may fall in the same subsets with positive probability. Although it is

di�cult to assess the estimate (2.7) in terms of bias and variance, empirical results by Kohavi (1995)

based on simulated as well as real data suggest that stratified cross-validation is a better scheme,

both in terms of bias and variance, when compared to classical cross-validation. For describing the


uncertainty of the estimate of the generalization error, we consider the standard deviation of dGE.

Define for k = 1, . . . ,K

CVk

:=1

|Bk

|X

{i:(xi

,y

i

)2Bk

}

(yi

� f(xi

,L\Bk

))2,

the performance on the k-th set. Assuming that |B1| = . . . = |BK

|, i.e. the Bk

are equally sized and

using that the data points (x1, y1), . . . , (xn

, yn

) are i.i.d., then the CVk

are identically distributed

as well. Unfortunately, they are in general not independent. However, assuming that the CVk

are

uncorrelated, it follows for the standard deviation of dGE,

sd(dGE) =

vuutV ar

1

K

KX

k=1

CVk

!=

1

K

vuutKX

k=1

V ar(CVk

) =sd(CV1)p

K.

This motivates the estimator\

sd(dGE) =bsd(CV1, . . . , CV

K

)pK

, (2.8)

where bsd denotes the sample standard deviation. For observations z1, . . . , zn, the sample standard

deviation is given by

bsd(z1, . . . , zn) =

vuut 1

n� 1

nX

i=1

(zi

� µ)2, with µ =1

n

nX

i=1

zi

.

2.3.2 Bootstrap

Introduced by Efron (1979), bootstrap is another general tool to asses the accuracy of a statistical

method. Consider the situation where the data L = {(x1, y1), . . . , (xn

, yn

)} are i.i.d. realization of

an unknown probability distribution P . Let f(x,L) denote a predictor obtained from the data L. Ifwe want to compute any statistics of f(x,L), we need to know its distribution. Deriving the exact

distribution is typically impossible, unless f is a simple function and P a convenient distribution.

The idea of Efron (1979) is as follows: If we knew the true P , we could simulate the distribution

of f with arbitrary accuracy. However, since we do not know the true underlying P , we use the

empirical distribution Pn

instead. The empirical distribution Pn

gives probability mass 1/n to every

data point (xi

, yi

) for i = 1, . . . , n. We can simulate i.i.d. data L⇤ = {(x⇤1, y

⇤1), . . . , (x

⇤n

, y⇤n

)} from

Pn

and compute the bootstrapped predictor f(x,L⇤). Repeating this many times approximates the

distribution of f(x,L⇤), which we use as an estimate of the distribution of f(x,L). Algorithm 2.1

shows how bootstrapping can be used to estimate the generalization error. Note that the quality of

the bootstrap approximation heavily depends on the quality and richness of the data L.

2.4. MAXIMUM LIKELIHOOD METHOD 10

Algorithm 2.1 Out-of-bag bootstrap algorithm to estimate the generalization error

for b = 1, . . . , B do

1. Generate a bootstrap sample L⇤b

:= {(x⇤1,b, y

⇤1,b), . . . , (x

⇤n,b

, y⇤n,b

)} by drawing with replacementfrom the data L. Define the out-of-bootstrap sample

L⇤b,out

=n[

i=1

{(xi

, yi

) : (xi

, yi

) /2 L⇤b

},

of all observations not included in the bootstrap sample.

2. Compute the bootstrapped predictor f(x,L⇤b

).

end forCalculate the generalization error of the bootstrap distribution by,

EP

n

((Y ⇤ � f(X⇤,L⇤))2) ⇡ 1

B

BX

b=1

1

|L⇤b,out

|X

{i:(xi

,y

i

)2L⇤b,out

}

(yi,b

� f(xi,b

,L⇤b

))2, (2.9)

where (X⇤, Y ⇤) ⇠ Pn

and |L⇤b,out

| = 1, if L⇤b,out

= ;. Note that for B ! 1, equation (2.9) becomesan equality. Furthermore, using the out-of-bootstrap sample for testing reduces over-optimism dueto data points included in both training and testing.

2.4 Maximum Likelihood Method

A common procedure to obtain a predictor from previously observed data is by minimization of

the in-sample sum of squaresP

n

i=1(yi � fn

(xi

))2 or equivalently the mean squared error (2.5), see

Example 2.1. However, if we knew more about the underlying distribution of the response variable,

we might be able to derive a better minimization criterion. This idea leads to a generalization of the

squared error loss function.

In this section, our goal is to find the model maximizing the probability that the observed data

originates from that particular model. This principle is called maximum likelihood (ML) method.

Assume that we have independent data y1, . . . , yn, where the yi

are drawn from Yi

described by

density functions g✓i

that depend on a common unknown parameter ✓ 2 Rk for some k 2 N. Due to

independence, the joint density, called likelihood, of our observations is given by

L(✓) =nY

i=1

g✓i

(yi

),

and the log-likelihood by

logL(✓) =nX

i=1

log g✓i

(yi

).

In the sequel we will use the log-likelihood since the logarithm is an increasing transformation and


therefore does not change the location of the maxima, but it is more convenient to work with. The

classical approach is to choose the parameter ✓ such that the probability of observing the data

y1, . . . , yn is maximized, i.e.

✓ = argmax✓2Rk

L(✓) = argmax✓2Rk

logL(✓), (2.10)

assuming that this maxima is well-defined.

Example 2.3. Let N1, . . . , Nn

be independent and Poi(�vi

) for given vi

> 0, then ✓ = � and

g✓i

(k) =(�v

i

)k

k!e��v

i .

The maximum likelihood estimator of � is given by

� =1Pn

i=1 vi

nX

i=1

Ni

.

Proof. Using the independence and Poisson assumption, we get the log-likelihood function

logL(�) =nX

i=1

log

✓(�v

i

)Ni

Ni

!exp (��v

i

)

◆.

Di↵erentiating with respect to � and evaluation at zero gives,

@

@�logL(�) =

nX

i=1

Ni

�� v

i

= 0.

Solving for � ends the proof. ⇤

?

In order to apply this methodology to the regression problem (2.1), define the conditional re-

sponse Y |X=x

. Assume there exists a density function g✓(x) depending on an unknown para-

meter function ✓ : Rp ! Rk of the covariates x with Y |X=x

⇠ g✓(x). Given i.i.d. observations

L = {(x1, y1), . . . , (xn

, yn

)} drawn from (X, Y ) following the structure of (2.1), it follows that the

yi

are independently drawn from Y |X=x

i

. Therefore the likelihood of the observations y1, . . . , yn for

given covariates x1, . . . ,xn

is given by

L(✓) =nY

i=1

g✓(xi

)(yi

).


Analogously to equation (2.10), we try to find ✓(·) such that

✓ = argmax✓2A

L(✓) = argmax✓2A

logL(✓), (2.11)

where A ⇢ C(Rp,Rk) denotes a subset of all admissible estimators for ✓(·) and C(Rp,Rk) the set

of all functions from Rp to Rk. In the regression problem, we are not primarily interested in the

function ✓(·) but rather in f(·). Since Y |X=x

⇠ g✓(x), we can write the expectation of Y |X=x

as a

function of ✓,

f(x) = E(Y |X = x) =

Zyg✓(x)(y)dy =: h(✓(x)),

where h : Rk ! R maps the unknown parameter ✓(x) to the expectation of Y |X=x

.

Example 2.4. We extend Example 2.3 by assuming that for given v > 0, N |X=x

⇠ Poi(�(x)v) and

define the response

Y =N

v= f(X) + ".

Then the conditional response is given by

Y |X=x

=N

v|X=x

= f(x) + ",

where the law of " is determined by N . Therefore, the parameter function ✓(·) is given by ✓(x) =

�(x)v and the regression function f(·) by

f(x) = E(Y |X = x) =1

vE(N |X = x) =

1

v✓(x).

?

Denote by ` : Rn ! R the log-likelihood as a function of the expected values of the conditional

response Yi

= Y |X=x

i

for i = 1, . . . , n given by

`(µ) := logL(f(x1), . . . , f(xn

)) = logL(h(✓(x1)), . . . , h(✓(xn

))),

where µ := (E(Y1), . . . ,E(Yn)) = (f(x1), . . . , f(xn

)). Then we can rewrite the optimization problem

(2.11) with respect to the unknown function f(·) as

f = argmaxf2A

`(µ) = argmaxf2A

`(f(x1), . . . , f(xn

)), (2.12)

where A denotes a set of admissible estimators for f(·). Alternatively to maximize `(µ), it is

equivalent to minimize the deviance statistics

D(y,µ) := 2(`(y)� `(µ)),


where y = (y1, . . . , yn) and µ := (f(x1), . . . , f(xn

)). The log-likelihood `(y) denotes by log-likelihood

of the saturated model, where we choose an own parameter for every observation.

Remark 2.5. Note that the set A is usually determined by the method considered to fit a model,

e.g. set of all tree predictors (see Chapter 3). J

Example 2.6. Assume that the data (xi

, yi

) for i = 1, . . . , n are drawn from the regression model

(2.1) with conditional response

Y |X=x

= f(x) + " ⇠ N (f(x), 1).

Therefore given x

i

, the yi

are independent and N (f(xi

), 1)-distributed for i = 1, . . . , n. The log-

likelihood as a function of µ = (f(x1), . . . , f(xn

)) for the observations y1, . . . , yn is given by

`(µ) =nX

i=1

log

✓1p2⇡

exp

⇢�1

2(y

i

� f(xi

))2�◆

= c(n)� 1

2

nX

i=1

(yi

� f(xi

))2,

where c(n) is a constant with respect to µ. Hence the deviance statistics is given by

D(y,µ) =nX

i=1

(yi

� f(xi

))2.

?

In Example 2.6 the deviance statistics is proportional to the mean squared error, implying that for a

normally-distributed response Y minimizing the mean squared error is equivalent to minimizing the

deviance statistics and hence obtaining the maximum likelihood estimator. In general, maximum

likelihood estimation provides a new procedure to obtain an optimal estimate for f . However, the

method requires additional knowledge about the distribution of the response such that the deviance

statistics can be calculated. In the context of insurance data, the Poisson distribution is a common

model for describing insurance claims counts. In Section 3.2.1, we will use the deviance statistics

derived from the Poisson distribution to improve the optimization criteria of the tree algorithm,

which we will study in the next chapter.

Chapter 3

Tree-Based Models

The considerations made in the previous section are fully general in the sense that they can be

applied to any set A of attainable predictors. However, the No Free Lunch Theorem states that

there is not one model that works best for every problem. In this thesis, we explore so-called tree

models as the set of attainable predictors. In this chapter, we introduce the concept of tree-based

models by describing the Classification and Regression Tree (CART) algorithm, which is the most

popular method to construct tree predictors. We then discuss extensions of the tree model such as

Bagging and Random Forest. The literature used in this chapter is largely based on Hastie et al.

(2001), Breiman et al. (1984), Breiman (1996), Breiman (2001) as well as Buhlmann and Machler

(2014).

3.1 Basics of Tree-Based Models

Let us assume that the data consists of p predictor variables X 2 Rp and a response Y . Suppose we

are given n i.i.d. observations (xi

, yi

) for i = 1, . . . , n drawn from (X, Y ). The main idea of a tree-

based model is to partition the covariate space R (assume for simplicity that R = Rp, a treatment of

categorical covariates can be found in Section 3.3.1) into a set of regions R1, . . . , RM

using recursive

binary splitting. Meaning, we start with the whole predictor space R and choose a splitting variable

j and split-point s 2 R such that R = R1(j, s) [R2(j, s) with

R1(j, s) = {x 2 Rp : xj

s} = R⇥ R⇥ · · ·⇥ (�1, s]⇥ R⇥ · · ·⇥ R,

R2(j, s) = {x 2 Rp : xj

> s} = R⇥ R⇥ · · ·⇥ (s,1)⇥ R⇥ · · ·⇥ R.

Then one or both regions are again split into two more regions. The process is repeated until a

stopping rule is applied. Hence, we end up with a partition R = {R1, . . . , RM

} of R. A crucial

implication of the recursive binary partitioning is the fact that one can identify the partition R with

a binary tree T , see Section 3.1.1 as well as Figure 3.1 for an illustrative example. The tree-based

14

3.1. BASICS OF TREE-BASED MODELS 15

model is then given by the piecewise constant function

f(x) =MX

m=1

cm

1{x2Rm

}. (3.1)

It remains to give an estimator cm

for cm

and an algorithm to find the splitting points. In Section

3.1.2 we describe the classical algorithm based on minimizing the sum of squares. In Section 3.2 we

discuss how this can be adapted to classification problems or the use of the deviance statistics as a

measure of goodness.

price < 18550

age < 26.5

price < 58750

0.14120 0.08202

0.11690

0.04594

0.04

594

0.14120

0.08202 0.11690

Price of Car (price)

Age

of D

river

(age

)

Figure 3.1 – A tree-based predictor fit to the response variable claims frequency (number of claims peryear) using the predictor variables age of driver and price of car. The left panel shows the partition of thetwo dimensional covariate space obtained by recursive binary splitting. The blue dots show the covariates(xi1, xi2)i=1,...,n and the numbers within each region represent the fitted constants cm. The right panelshows the corresponding tree representation.

3.1.1 Tree Terminology

In Figure 3.1 we have identified the partition R obtained by binary partitioning with a binary tree.

Before going into details on the tree growing procedure, we give a formal definition of a tree T and

introduce the common terminology used in the context of tree models.

Definition 3.1 (Node). A node t is a triple (Nt

, jt

, st

) with Nt

⇢ {1, . . . , n}, jt

2 {1, . . . , p} and

st

2 R.

Therefore each node consists of a splitting variable jt

and a split-point st

as well as a subset Nt

of

all observations. Given a node t, we can define the function left mapping

t 7! left(t) = {i 2 Nt

: xij

t

st

} ⇢ Nt

,


as well as the function right mapping

t 7! right(t) = {i 2 Nt

: xij

t

> st

} = Nt

\left(t) ⇢ Nt

.

The functions left and right define the two regions into which the observations at node t are split.

The node containing all observations i = 1, . . . , n will be of special importance and hence is given a

separate name.

Definition 3.2 (Root). A node t is called root if Nt

= {1, . . . , n}.

Finally, the index sets of the regions Rm

2 R = {R1, . . . , RM

} are called leaves in the tree termino-

logy.

Definition 3.3 (Leaf). A leaf L is a subset of {1, . . . , n}.

Based on the definitions of a root, nodes and leaves, we can define a tree as a set of nodes containing

a root node and leaves.

Definition 3.4 (Tree). A set of nodes and leaves T = {t1, . . . , tK , L1, . . . , LM

} is called a tree if

1. there exists a root node t 2 {t1, . . . , tK}, write t = root(T ),

2. for every non root node t 2 {t1, . . . , tK}\{root(T )} there exists a node s 2 {t1, . . . , tK} such

that Nt

= left(s) or Nt

= right(s),

3. for every leaf L 2 {L1, . . . , LM

} there exists a node s 2 {t1, . . . , tK} such that L = left(s) or

L = right(s).

3.1.2 Growing a Regression Tree

Having introduced the common terminology, we next focus on how to obtain the partition R. For

regression the most common minimization criterion is the sum of squares given byP

n

i=1(yi�fn

(xi

))2,

which is equivalent to minimizing the mean squared error (2.5). Given a partition R the estimate

cm

minimizing the sum of squares in each region Rm

is given by

cm

=1

nm

X

{i:xi

2Rm

}

yi

, (3.2)

where nm

:= |{i : xi

2 Rm

}| is the number of observations in region Rm

(see proof in Appendix

A.1). Unfortunately, finding the best binary partition R which minimizes the overall sum of squares

is computationally unfeasible, therefore we are going to describe a greedy procedure, called CART-

algorithm:


Algorithm 3.1 Classification and Regression Tree (CART) Algorithm

Step 1 - Initialization: Initialize the pairs of half-planes

R1(j, s) = {x : xj

s} and R2(j, s) = {x : xj

> s}.

Find the splitting variable j and split-point s of these half-planes such that

minj,s

2

4minc1

X

{i:xi

2R1(j,s)}

(yi

� c1)2 +min

c2

X

{i:xi

2R2(j,s)}

(yi

� c2)2

3

5 . (3.3)

Fortunately, the inner minimizations are simply given by

c1 =1

|{i : xi

2 R1(j, s)}|X

{i:xi

2R1(j,s)}

yi

and c2 =1

|{i : xi

2 R2(j, s)}|X

{i:xi

2R2(j,s)}

yi

.

Hence the minimization problem (3.3) can be solved very quickly, leading to a partitionR = R1[R2

with regions R1 = R1(j, s) and R2 = R2(j, s), where (j, s) denotes the argmin of equation (3.3).

Step 2 - Recursion: To get the next splitting variable as well as split-point, repeat the above

procedure on each region contained in R resulting in |R| di↵erent partitions Rk

for k = 1, . . . , |R|.In order to get one new region, choose the one that minimizes the overall sum of squares

K+ = argmink2{1,...,|R|}

nX

i=1

(yi

� fk

n

(xi

))2,

where fk

n

denotes the tree predictor based on the partition Rk

. Update the partition R = RK

+ .

Step 3 - Stop: Repeat Step 2 until a stopping rule is fulfilled, see Remark 3.9.

A tree predictor obtained by the above procedure will be denoted anova regression tree. In R the

package rpart provides an e�cient implementation, see Section 5.1. A generalization of Algorithm

3.1 to classification can be found in Section 3.2.3.

3.1.3 Cost-Complexity Pruning

Repeating Step 2 of Algorithm 3.1 may produce a very large tree. In an extreme case, we have as

many regions as observations, i.e. M = n and the (in-sample) mean squared error is equal to zero.

However, it is highly likely, that the predictor overfits the observations, implying poor performance

for predicting unseen data. On the other hand, if the tree size is too small, we may not be able to

capture the true underlying structure. Hence the tree size is the crucial tuning parameter, which


needs to be chosen from the data. In order to control the tree size, we describe cost-complexity

pruning. Assume that we grew a large tree Tmax

such that no further splits were possible. We call

T ⇢ Tmax

a subtree if it can be obtained by pruning T , i.e. collapsing any number of nodes.

Definition 3.5 (Subtree). Let T1 = {t11, . . . , t1K1

, L11, . . . , L

1M1

} and T2 = {t21, . . . , t2K2

, L21, . . . , L

2M2

}be trees. Then T1 is a subtree of T2, write T1 ⇢ T2 if

1. root(T1) = root(T2),

2. {t11, . . . , t1K1

} ⇢ {t21, . . . , t2K2

}.

See Figure 3.2 for an illustrative example. Let |T | = M denote the number of leaves in T . For given

↵ � 0, we define the cost-complexity criterion

C↵

(T ) = R(T ) + ↵|T |, (3.4)

where

R(T ) =

|T |X

m=1

X

{i:xi

2Rm

}

(yi

� cm

)2

is the total sum squares of the tree T . The goal is to find the subtree T (↵) ⇢ Tmax

that minimizes

C↵

(T ). The tuning parameter ↵ controls how strong we penalize large trees. For ↵ = 0, we obtain

T (↵) = Tmax

, whereas if ↵ ! 1, the resulting subtree is only based on the root, which fits the

overall mean

fn

(x) =1

n

nX

i=1

yi

to all observations. Therefore minimizing C↵

(T ) leads to a subtree that can be seen as an optimal

trade-o↵ between a good fit and keeping the tree as small as possible.

Definition 3.6 (Optimally Pruned Subtree). Given ↵ � 0, a subtree T (↵) of T is called optimally

pruned subtree if the following holds true:

1. C↵

(T (↵)) = minT

0⇢T

C↵

(T 0),

2. if C↵

(T 0) = C↵

(T (↵)), then T (↵) ⇢ T 0.

By definition, if there exists such a subtree it is unique, since we choose it to be the smallest one.

The question arising is whether or not such an optimally pruned subtree exists. Note there are only

finitely many subtrees of T , hence we can always find at least one which fulfills the first conditions.

What if there exists two minimizers T1 and T2 of (3.4) such that neither is a subtree of the other? We

show in the next proposition that this situation will not occur, i.e. we can establish both uniqueness

and existence of the optimally pruned subtree.


Proposition 3.7. Let T be a tree. For every ↵ � 0, there exists a unique optimally pruned subtree

T (↵). Additionally, if ↵1 ↵2, it holds that T (↵2) ⇢ T (↵1).

The proof is based on induction and can be found in Appendix A.2. It remains to give an e�cient

algorithm to determine the optimally pruned subtrees. Let #T be the number of subtrees of T . If

T is a root tree, then #T = 1. Otherwise, we have

#T = #TL

·#TR

+ 1,

where TL

and TR

are the two primary branches of T , see Figure A.1. This indicates that the

maximum number of subtrees grows rapidly in the number of leaves. However, there is no need

to search through all possible subtrees to find T (↵). When increasing ↵ the optimal subtree T (↵)

becomes smaller until it only consists of the root tree, see Figure 3.3. Since there are only finitely

many subtrees, we obtain an increasing sequence of ↵k

such that for ↵k

↵ < ↵k+1, T (↵) = T (↵

k

).

By Proposition 3.7, we know that for ↵k

↵ < ↵k+1, T (↵k+1) ⇢ T (↵

k

) = T (↵). In order to find

Figure 3.2 – The figure shows the tree T1 and all its subtrees. The response variable predicted by T1

is the claims frequency based on the covariates nationality group (nat gr) and age of driver (age). Thedotted arrows correspond to the pruning process described in Figure 3.3.

the sequence ↵k

with the corresponding optimally pruned subtree, we start with ↵1 = 0 and by

construction T (↵1) = Tmax

. To find ↵2 and T (↵2), we denote for any node t 2 Tmax

by Tt

the

subtree with root t and by {t} the tree consisting of the single node t. As long as

C↵

(Tt

) < C↵

({t}) (3.5)


the branch Tt

has a smaller cost-complexity as the single node {t}. However solving the inequality

for ↵, leads to a critical value ↵ with

↵ <R({t})�R(T

t

)

|Tt

|� 1=: ↵,

where the inequality (3.5) becomes an equality and therefore {t} is preferable. Next we define the

function g(t) by

g(t) :=

8<

:

R({t})�R(Tt

)|T

t

|�1 if t 2 Tmax

,

+1 else,

where Tmax

denotes the set of all nodes. Then the weakest link t of Tmax

is given by

t = argmint2T

max

g(t),

meaning that when increasing ↵, t is the first node such that C↵

({t}) = C↵

(Tt

). Define ↵2 :=

g(t) = mint2T

max

g(t) and T the tree obtained by cutting away the branch Tt

. Then T minimizes C↵2(T )

over all subtrees of Tmax

. Hence by uniqueness, we have that T = T (↵2). Since we know that

T (↵3) ⇢ T (↵2), we can simply set Tmax

= T (↵2) and repeat the above procedure to obtain ↵3 as

well as T (↵3). Therefore the problem of finding the optimally pruned subtree can be implemented

R(T1)

R(T2)R(T3)

R(T4)

R(T5)

R(T6)

Cα(T)=

R(T)+

α|T|

α

Cost Complexity Pruning

α1 = 0 α2 = 6 α3 = 10 α4 = 11.75 α5 = 20.5

Figure 3.3 – Each line shows the cost-complexity criterion C↵(T ) for a subtree T of T1 of Figure 3.2.The plot illustrates the pruning process as a function of ↵. For a fixed ↵ there always exists a uniquesubtree minimizing C↵(T ). On the x-axis we see the sequence ↵k of cost-complexity parameters. Notethat T2 and T3 have the same slope, i.e. the same number of leaves. Since R(T2) < R(T3), we alwaysprefer T2 over T3. In the figure, C↵(T3) is shown by the dotted line.


as a computationally rapid recursive pruning. Figure 3.3 shows the cost-complexity criteria for the

subtrees shown in Figure 3.2. The slopes represent the tree sizes and the y-intercepts the measure

of goodness of the respective trees. It illustrates how the optimally pruned subtree becomes smaller

when increasing ↵. It remains to determine the optimal choice of ↵. An estimation is usually obtained

by cross-validation, see Section 2.3.1 and Section 5.1.

Remark 3.8. In the rpart package the following scaled version of the cost-complexity criteria (3.4)

is used,

Ccp

(T ) = R(T ) + cpR(T (1))|T |,

with

cp :=↵

R(T (1)).

This has the advantage that for cp = 1, the optimally pruned tree is given by the root tree. To show

this, let T be a tree. Then the root tree T (1) is preferred over T for all ↵ � 0 with

C↵

(T (1)) C↵

(T ),

which is equivalent to

↵ � R(T (1))�R(T )

|T |� 1.

Since this holds for any tree T with R(T ) � 0 and |T | � 2, we can take the maximum over all T on

both sides, which is obtained at R(T ) = 0 and |T | = 2. Hence we get that ↵ � R(T (1)), which is

equivalent to cp � 1. Choosing cp = 0 is equivalent to ↵ = 0, hence the optimally pruned subtree

coincides with the fully grown tree. J

Remark 3.9. In practice it may take a large amount of computing time to fully grow a tree until no

further split is possible. Therefore, let T be the current tree. Denote by T (↵) the largest optimally

pruned subtree of T with |T | > |T (↵)|, i.e. a strict subtree. Let ↵ > ↵ such that T (↵) ⇢ T (↵)

denotes the next (strictly) smaller optimally pruned subtree. We stop the growing procedure if the

improvement of fit of the optimally pruned subtree T (↵) to the next larger optimally pruned subtree

T (↵) is smaller then a user-specified constant c 2 [0, 1]. According to the cost-complexity pruning

procedure, T (↵) is preferred over T (↵) for all ↵ � 0 such that

C↵

(T (↵)) C↵

(T (↵)).

This is equivalent to

↵ � R(T (↵))�R(T (↵))

|T (↵)|� |T (↵)| ,

or using the parametrization of the rpart package

cp � 1

|T (↵)|� |T (↵)|

✓R(T (↵))

R(T (1))� R(T (↵))

R(T (1))

◆. (3.6)

3.2. MODIFICATIONS 22

The rpart implementation contains the option to bound the right-hand side of (3.6) by the constant

c, therefore imposing a condition on the maximum tree size. In combination with Remark 3.8, we

see that choosing c = 0 results in growing a tree as large as possible, whereas c = 1 only grows the

root tree. Also note that when using the squared error loss (as in Algorithm 3.1) the right-hand side

of equation (3.6) is equal to the improvement in the coe�cient of determination R2 scaled by the

di↵erence in number of leaves (see Chapter 7.2.2 in Wuthrich (2015)). J

Remark 3.10. We assumed that C↵

(T ) = R(T )+↵|T |, where R(T ) =P|T |

m=1

P{i:x

i

2Rm

}(yi� cm

)2.

However, we never used the explicit form of R(T ). Therefore, the results on cost-complexity pruning

are valid in more generality for any measure of goodness R(T ). J

3.2 Modifications

In this section, we discuss several extensions of Algorithm 3.1. The modifications are chosen in view

of the insurance data set we analyze in Part 2 of this thesis. Some of the proposed extensions were

originally used in cancer research to model survival data of cancer patients. The material of this

section is based on LeBlanc and Crowley (1991).

3.2.1 Poisson Regression Tree

Based on the idea of Section 2.4, we generalize Algorithm 3.1 to measures of goodness derived from

the deviance statistics. In this section, we describe the modification for a Poisson-distributed count

variable N . Given v > 0, we define the response, often called frequency,

Y =N

v= f(X) + ",

with N |X=x

⇠ Poi(�(x)v). For i.i.d. observations (x1, N1, v1), . . . , (xn

, Nn

, vn

) with vi

> 0, the

deviance statistics associated to the Poisson-distributed counts Ni

is given by

2nX

i=1

✓N

i

log

✓N

i

�(xi

)vi

◆� (N

i

� �(xi

)vi

)

◆, (3.7)

where the function Ni

7! Ni

logNi

can be continuously extended to N0 by setting

Ni

7!

8<

:N

i

logNi

if Ni

> 0,

0 if Ni

= 0.

This extension will be used throughout this text. See Appendix A.3 for the proof of equation (3.7).

Assuming that the data originates from a Poisson-distributed count variable N , we can adapt the


minimization criterion in Algorithm 3.1 to find a splitting variable j and a split-point s such that

minj,s

"min�1

X

{i:xi

2R1(j,s)}

✓N

i

log

✓N

i

�1vi

◆� (N

i

� �1vi)

◆

+min�2

X

{i:xi

2R2(j,s)}

✓N

i

log

✓N

i

�2vi

◆� (N

i

� �2vi)

◆#,

where the inner minimizations are given by

�1 =1P

{i:xi

2R1(j,s)} vi

X

{i:xi

2R1(j,s)}

Ni

, resp. �2 =1P

{i:xi

2R2(j,s)} vi

X

{i:xi

2R2(j,s)}

Ni

. (3.8)

Equation (3.8) follows by Example 2.3, since minimizing the deviance statistics is equivalent to

maximizing the likelihood. The procedure leads to a maximum likelihood model for each leaf of

the tree. However we need to be careful, when using the minimizer (3.8), since it can happen that

either �1 = 0 or �2 = 0, e.g. if all Ni

are zero within the chosen group. In this case, the Poisson

model is not well defined (degenerate). A possible solution is the use of a Bayesian estimator for �.

The idea is to assume the data originates from a Poisson-Gamma model. The following results on

Poisson-Gamma models are based on Chapter 8 of Wuthrich (2015).

Definition 3.11 (Poisson-Gamma model). Given fixed values vi

> 0 for i = 1, . . . ,m, assume that:

1. Conditionally, given ⇤, the components of N = (N1, . . . , Nm

) are independent with Ni

⇠Poi(⇤v

i

).

2. ⇤ ⇠ �(�, c) with prior parameter � > 0 and c > 0.

We denote by �0 = E(⇤) = �/c the a priori expected frequency. However, having observed

N1, . . . , Nm

, the aim is to determine the posterior estimate for the expectation of ⇤ given by

�post := E(⇤|N1, . . . , Nm

).

Proposition 3.12. Assume N = (N1, . . . , Nm

) follows a Poisson-Gamma model of Definition 3.11,

then the posterior estimator �post has the following form

�post = ↵�+ (1� ↵)�0,

with credibility weights ↵ and observation based estimator � given by

↵ =

Pm

i=1 vic+

Pm

i=1 viand � =

1Pm

i=1 vi

mX

i=1

Ni

.

A proof for Proposition 3.12 can be found in Chapter 8 of Wuthrich (2015). Assume that the data

N

l

= (N1, . . . , Nm

l

) in each node l, where ml

denotes the number of observations in node l, is


described by a Poisson-Gamma model. Setting �2 := V ar(⇤) = �/c2, the coe�cient of variation is

given by V co(⇤) = �/�0 = 1/p� =: k. We can write the prior distribution parameter c as

c =

r�

�2=

1

k�=

1

k2�0.

Denote by

�0 =1Pm

l

i=1 vi

m

lX

i=1

Ni

the overall maximum likelihood estimator as an estimate for the prior mean �0. Using Proposition

3.12, we get the estimator

�post

l

= ↵l

�l

+ (1� ↵l

)�0, (3.9)

where the weights ↵l

can be written as

↵l

=

Pm

l

i=1 vi1

k

2�0

+P

m

l

i=1 vi. (3.10)

Comparing the estimators (3.8) and (3.9), we see that the latter is a credibility weighted average of

the overall average and the Poisson MLE estimator. Note that estimator (3.9) is non-optimal in a

credibility context. One could instead use the homogeneous credibility estimator, where we would

choose

�0 =1

PM

l=1 ↵l

MX

l=1

↵l

�l

,

see Buhlmann and Gisler (2005).

Remark 3.13. Note that the coe�cient of variation k of ⇤ plays the role of a tuning parameter

for the credibility weights, i.e. ↵l

= ↵l

(k). However there may be a di↵erent ⇤l

associated to each

leaf l, hence k should be estimate separately for each possible split. The implementation available

within the rpart package does not take this into account and uses an a priori estimate k for all leaves.

Estimating k from data is further discussed in Section 5.3.1. J

The interpretation of the credibility weights ↵l

is as follows:

1. If k ! 0, then ↵l

! 0, i.e. when the variation of the prior tends to zero, we only believe in the

prior expectation �0.

2. If k ! 1, then ↵l

! 1, i.e. for an uninformative prior, we only believe in the observation

based estimator �l

.

3. IfP

m

l

i=1 vi ! 1 then ↵l

! 1, i.e. we only believe in the observation based estimator.


Remark 3.14. Regarding an application to insurance data, the third property is rather interesting.

SinceP

m

l

i=1 vi usually denotes the volume of a subportfolio, the property formalizes that for large

portfolios, we believe more in the observation based estimator. J

Remark 3.15. Note that using estimator (3.9) instead of estimator (3.8), we cannot guarantee that a

further split decreases the deviance statistics. Consider a splitting variable j and a split-point s with

corresponding regions R1(j, s) and R2(j, s). Let µ(�) = (�v1, . . . ,�vn), µ1(�) = (�vi

: i 2 R1(j, s))

and µ2(�) = (�vi

: i 2 R2(j, s)). Then it follows that

D(y,µ(�)) = min�

D(y,µ(�)) = D(y1,µ1(�)) +D(y2,µ2(�))

� min�

D(y1,µ1(�)) + min�

D(y2,µ2(�))

= D(y1,µ1(�1)) +D(y2,µ2(�2)),

where D denotes the Poisson deviance statistics, y1 = (Ni

: i 2 R1(j, s)), y2 = (Ni

: i 2 R2(j, s))

and �, �1, �2 are derived from estimator (3.8). Using the Bayesian estimator (3.9) this does not

hold true in general. Hence it may occur that at some point, the best split does not lead to a

decrease in the deviance statistics. Note that the rpart implementation only considers splits which

decrease the deviance statistics. This may be problematic if there is a strong shrinkage to the prior

mean. However a close inspection of the rpart source code (see poisson.c-file, lines 199-200 of the

rpart source code, Therneau et al. (2015)) reveals that whilst growing the tree, the MLE estimator

(3.8) is used. Only the predictions in the leaves are obtained by the Bayesian estimator (3.9) (see

poisson.c-file, lines 119-128), where the parameter k ensures that � > 0 and that the Poisson model

is well-defined. Allowing � = 0 during the growing procedure is not problematic, since it implies that

the deviance statistics (3.7) becomes infinite. Such a split will not decrease the deviance statistics

and is therefore not considered by the algorithm anyway. J

Due to Remark 3.10, the pruning process described in Section 3.1.3 adapts directly to the use of the

deviance statistics as the new measure of goodness.

3.2.2 Weighted Least Squares

Alternatively to the likelihood based deviance statistics, the tree performance can also be assessed

by a weighted least squares measure. This method is very similar to Algorithm 3.1, except for the

use of weights wi

in the minimization criterion,

minj,s

2

4minc1

X

{i:xi

2R1(j,s)}

wi

(yi

� c1)2 +min

c2

X

{i:xi

2R2(j,s)}

wi

(yi

� c2)2

3

5 . (3.11)


The inner minimizations are given by

c1 =1P

{i:xi

2R1(j,s)}wi

X

{i:xi

2R1(j,s)}

wi

yi

, resp. c2 =1P

{i:xi

2R1(j,s)}wi

X

{i:xi

2R2(j,s)}

wi

yi

. (3.12)

See Appendix A.4 for a proof of equation (3.12).

Remark 3.16. If wi

= 1 for i = 1, . . . , n, we obtain the classical case described in Section 3.1.2. J

Example 3.17. Analogously to Section 3.2.1, assume that for given v > 0 the response originates

from a count variable N such that

Y =N

v= f(X) + " with V ar(Y |X = x) = g(v), (3.13)

for a strictly decreasing function g : [0,1) ! [0,1). This means that the variance decreases in v,

which is called heteroscedasticity. Since every strictly decreasing function is bijective, there exists an

inverse function h : [0,1) ! [0,1) of g(·), which is again strictly decreasing. Assume we are given

the i.i.d. observations (x1, N1, v1), . . . , (xn

, Nn

, vn

) with yi

= Ni

/vi

. When choosing the weights

wi

= vi

, the weighted sum of squares is given by

nX

i=1

wi

(yi

� c)2 =nX

i=1

vi

(yi

� c)2 =nX

i=1

h(V ar(Y |X = x

i

))(yi

� c)2,

i.e. observations with large variance are less weighted within the optimization criterion. By (3.13)

and wi

= vi

the minimizer (3.12) is given by

� =1Pn

i=1 vi

nX

i=1

Ni

,

which coincides with the estimator used for Poisson regression trees without making any distribu-

tional assumptions. However in general, the resulting trees do not coincide since the optimization

criteria di↵er. Moreover, there is no di�culty with � = 0, although this does not make sense in the

context of modeling claims frequencies.

Alternatively we could use wi

= 1/V ar(Y |X = x

i

) = 1/g(vi

), then the minimization criterion

becomesnX

i=1

wi

(yi

� c)2 =nX

i=1

yi

� cpg(v

i

)

!2

.

This means that we first normalize the variance of the residuals to one and then consider the squared

error. However, to obtain the function g(·) we need to make distributional assumptions such that

we are able to derive g(·) = V ar(Y |X = ·). Also note that g(vi

) may depend on f(xi

), therefore in

general this approach is not compatible with the rpart implementation. ?


3.2.3 Classification Tree

So far we always considered continuous response variables. However in many applications, we may

face the situation that the response Y only takes finitely many discrete values {0, . . . , J � 1}, e.g.the number of claims of an insurance policy. When dealing with predicting a discrete response, one

usually refers to classification instead of regression. The model (2.1) needs to be slightly modified,

requiring f : Rp ! {0, 1, . . . , J � 1} and the error term to be discrete as well. We then call the

function f a classifier and the problem of determining f a classification problem. Fortunately, the

tree building algorithm described in Section 3.1 can easily be adapted to classification problems.

The only changes we have to make are the choice of the splitting criterion and the corresponding

estimator for the constant in each leaf. For a leaf m, recall that Rm

denotes its region in the covariate

space and nm

the number of observations in Rm

. The idea is to find splits that partition the data

into subsets which are more pure, measured by an impurity function, than before the splitting.

Definition 3.18 (Classification Rate). For each node m and class k the classification rate p(m, k)

is given by

p(m, k) =1

nm

X

{i:xi

2Rm

}

1{yi

=k}.

It holds that

p(m, 0) + . . .+ p(m,J � 1) = 1.

Definition 3.19 (Impurity Function). An impurity function � is defined on all J-tuples (p0, . . . , pJ�1)

with pj

� 0 andP

J�1j=0 p

j

= 1 satisfying the properties:

1. � is only maximal at the point ( 1J

, . . . , 1J

).

2. � is minimal in the points (1, 0, . . . , 0), . . . , (0, . . . , 0, 1).

3. � is symmetric in p0, . . . , pJ�1, i.e.

�(p0, . . . , pJ�1) = �(p⇡(0), . . . , p⇡(J�1)),

for any permutation ⇡ : {0, . . . , J � 1} ! {0, . . . , J � 1}.

Given an impurity function �, we can determine the impurity of a leaf m by

�(p(m, 0), . . . , p(m,J � 1)).

By definition of the impurity function, it is maximal if in a leaf all classes occur equally often resp.

minimal if there is only one class in a leaf. The impurity function � most commonly used in the

3.3. ADDITIONAL FEATURES 28

context of classification trees is called Gini index and defined by

G(p(m, 0), . . . , p(m,J � 1)) :=X

k 6=l

p(m, k)p(m, l) =J�1X

k=0

p(m, k)(1� p(m, k)).

The Gini index can be interpreted as follows: Assume that each element in a leaf m is classified

to class k with probability p(m, k). Denote by Y a predictor for Y , then the training error, i.e the

probability of misclassification, is given by

P (Y 6= Y ) = E⇣1{Y 6=Y }

⌘= E

⇣E⇣1{Y 6=Y }

⌘|Y⌘=

J�1X

k=0

P (Y = k)E⇣1{Y 6=Y }|Y = k

⌘

=J�1X

k=0

P (Y = k)X

l2{0,...,J�1}\{k}

p(m, l) =X

k 6=l

P (Y = k)p(m, l).

Hence the Gini index denotes an estimate of the training error. Using the impurity function as a

measure of goodness, in each step of the recursion, the tree algorithm chooses the splitting variable

j and split-point s such that the impurity is improved the most. Let m be a leaf, for each pair (j, s),

we get two child leaves mL

(j, s) and mR

(j, s). We modify the algorithm such that we aim to find

(j, s) with

(j, s) = argmaxj,s

[G(m)� [G(mL

(j, s)) +G(mR

(j, s))]] ,

where G(m) = G(p(m, 0), . . . , p(m,J � 1)). Having determined the splits, we use the majority vote

k(m) = argmaxk

p(m, k),

within each leaf to determine the classifier. With a deterministic rule if there are two maximal classes

with the same vote.

3.3 Additional Features

3.3.1 Categorical Covariates

Often in practice, we encounter both numerical and categorical covariates. When splitting a categor-

ical covariate having q possibly unordered values, there are 2q�1�1 possible partitions of the q values

in two disjoint groups. Unfortunately, the computational burden of an exhaustive search through all

partitions become immense if q is large. However, we can substitute the categorical covariate by a

numerical covariate containing the mean of the response for the respective covariate levels and apply

the tree algorithm to the new numerical covariate. According to Hastie et al. (2001), one can show

that this leads to the optimal split under the squared error loss function.


Example 3.20. Assume that an insurance company stores the number of claims (nc), the duration

of the insurance contract (jr) and the sex for each policyholder (pol). The aim of the insurer is to

predict the claims frequency (claims / duration) using the sex of the policyholder. In order apply the

binary splitting algorithm to the covariate sex, a temporary (numerical) covariate fsex is generated,

storing the mean of the response for the respective covariate levels, as illustrated in Table 3.1. The

tree algorithm is then applied to the covariate fsex.

pol 1 2 3 4 5 6 7 8 9 10

nc 1 0 0 1 0 0 0 1 1 0

jr 1 1 1 1 1 1 1 1 1 1

sex m m w m w w m w m m

fsex 0.5 0.5 0.25 0.5 0.25 0.25 0.5 0.25 0.5 0.5

Table 3.1 – An example on how categorical covariates (e.g. sex ) get transformed to numerical covariates(e.g. fsex) when fitting a regression tree. The data contains six male drivers, which have observed twoclaims, hence the mean of the outcome is given by 0.5, analogously for the female drivers.

?

3.3.2 Missing Data and Surrogate Splits

In this section we are not concerned about how we choose the splits, but rather how we deal with

missing data. Many methods in statistics deal with missing data by refusing to deal with them, i.e.

deleting observation with missing covariates in advanced. We usually have two objectives in mind

when considering missing data: First of all, make maximum use of the given observations even if

some covariates are missing. Secondly, being able to make prediction for cases with some covariate

values missing. Having the first objective in mind, it is not satisfying to discard any non-complete

observation during the training process. Possible solutions are:

1. Filling in the missing values, e.g. using the mean of the non-missing predictor variables.

2. For categorical variables, we can introduce a new category ”missing”.

3. Make use of surrogate variables.

In the following, we explain surrogate splits in more details. When considering a predictor variable

for a split, we only use the observations for which that covariate is not missing. This leads to a

primary splitting variable and split-point. Before continuing with the binary splitting algorithm, we

figure out which other covariate and corresponding split would best mimic the primary split, see

Algorithm 3.2. For example, assume that at the current stage the algorithm decided to split the

variable age of driver at the splitting point (age of driver 40, age of driver > 40). After that we


Algorithm 3.2 Calculation of the surrogate variables

Precondition: Let t = (N, j, s) be a node according to Definition 3.1.

function surrogate(node = (N, j, s))# First check whether the split sends an observations to the left or to the right:yk

:= 1{xjk

s} for all k 2 N .

# Calculate the surrogate splits:for i 2 {1, . . . , p}\{j} do

surrogate split[i] := class tree(response = y, covariate = xj

, size = 2)yk

:= predict(surrogate split[i], xik

), for all k 2 N

misclass[i] :=P

1{yk

6=y

k

}|N |

end forend function

use the tree algorithm to predict the two categories {age of driver 40} and {age of driver > 40}.For that purpose we fit a univariate classification tree with two leaves, modeling the discrete response

Y = 1{age of driver 40}, for each covariate except the one used in the primary split. This leads to a

sequence of surrogate variables, which can be ordered according to their misclassification errors. If

we need to evaluate the tree predictor during the training or testing phase at a point where we have

missing data, we can use the surrogate splits instead. The concept of surrogate splits was introduced

by Breiman et al. (1984) trying to reduce the e↵ect of missing data by exploiting correlations between

the predictor variables.

3.3.3 Variable Importance

In this section, we address the question, which are the most important covariates in a tree predictor?

In the literature, there are slightly di↵erent notions of variable importance related to regression and

classification trees. The definition we use is based on Chapter 5.3.4 of Breiman et al. (1984).

A critical issue regarding variable importance is how to deal with variables, which do not give

the best split for a node but may be second or third best. For example, covariate x1 may never occur

as a split in the final tree. However, if a second covariate x2 is removed and a new tree is grown,

x1 may occur very prominently in the splits with a tree almost as accurate as the original. In such

situations, we require that a measure of variable importance is capable to detect the importance of

x1. The solution proposed by Breiman et al. (1984) is based on using both primary and surrogate

splits, see Section 3.3.2. Let T be the final optimal subtree, and let R be the measure of goodness

used to grow the tree. At each node t 2 T , we define �R(t, x) the decrease in the loss due to splitting

variable x, where x may be a primary or surrogate splitting variable.


Definition 3.21 (Variable Importance). The measure of importance of variable x is defined as

M(x) =X

t2T�R(x, t).

In Appendix C.1, we provide an implementation of the variable importance function according to

Definition 3.21, which is compatible with the rpart package. The measure is often normalized to

100 for the most important covariate. Unfortunately, variable importance does not contain any

information about the direction of influence of a particular covariate on the response.

Example 3.22. Analogously to Example 3.20, we assume that an insurance company wants to

predict the claims frequency based on the covariates age of driver (age), price of car (price) as well

as policynumber (pol). Figure 3.4 shows the variable importance for an example tree predictor. As

expected, we see that both age and price seem to occur prominently in the tree. The policynumber

is not used at all and hence results in a variable importance of zero.

age price pol

Varia

ble

Impo

rtanc

e0

2040

6080

100

Figure 3.4 – Variable importance plot for Example 3.22.

?

Remark 3.23. Note that the rpart packages also provides a build-in function for variable importance.

However, the implementation multiplies the improvement �R(x, t) by an agreement factor if x was

used as a surrogate split in node t. The agreement factor takes into account to what extent the

surrogate split is capable to mimic the primary split. For our purpose, we will work with the original

definition given in Breiman et al. (1984). J

3.4. RANDOM FOREST 32

3.4 Random Forest

In practice, classification and regression trees have shown to be rather unstable, in the sense that

small changes in the training data may lead to large changes in the predictions, see also Section 3.4.2.

In this section we present the random forest algorithm introduced by Breiman (2001), which tries

to reduce the variability of an individual tree predictor. Conceptually, random forest is based on

combining many tree predictors. For that purpose, we first introduce bagging a key building block

of random forest. We call methods derived from many individual tree predictors tree ensembles.

3.4.1 Bagging

Bagging was introduced by Breiman (1996) and consists of the two elements bootstrapping and

aggregating. We first discuss the concept of aggregation.

Aggregating

Assume we have a training data set L = {(xi

, yi

) : i = 1, . . . , n}, where (x1, y1), . . . , (xn

, yn

) are

i.i.d. drawn from a distribution P . Using this training data we obtain the predictor f(x,L). Now

suppose that we are given a sequence of training data sets {Lk

}k�1 each consisting of n observations

drawn from the same distribution. Instead of one single predictor f(x,Lk

), the idea of aggregation

is to use the average over the training sets {Lk

}k�1, i.e.

fagg

(x) = limK!1

1

K

KX

k=1

f(x,Lk

) = E(f(x,L)), P -a.s.

as a new predictor. Note the convergence follows by the law of large numbers. An important property

of aggregating is that it does not change the bias since,

Bias(fagg

(x)) = f(x)� E(fagg

(x)) = f(x)� fagg

(x) (3.14)

= f(x)� E(f(x,L)) = Bias(f(x,L)).

However, aggregating can substantially increase the predictive capacity measured by the mean

squared error. Fix a data point (x, Y ) independent from the learning set L, then

E((Y � f(x,L))2|Y ) = Y 2 � 2Y E(f(x,L)|Y ) + E(f(x,L)2|Y ).

Due to independence of Y and L, E(f(x,L)|Y ) = E(f(x,L)) = fagg

(x) and by Jensen’s inequality

it follows that E(f(x,L)2) � E(f(x,L))2. Together this implies that

E((Y � f(x,L))2|Y ) � (Y � fagg

(x))2.


Taking the expectation with respect to Y on both sides gives

E((Y � f(x,L))2) � E((Y � fagg

(x))2), (3.15)

i.e. aggregating reduces the conditional mean squared error. The magnitude of the decrease depends

on how unequal the two sides of

E(f(x,L))2 E(f(x,L)2)

are, or equivalently the bigger the variance V ar(f(x,L)) the larger the reduction in the conditional

mean squared error. Combing Proposition 2.2 and equation (3.15), it follows that

E((Y � f(x,L))2) = Bias⇣f(x,L)

⌘2+ V ar(f(x,L)) + V ar(")

� E((Y � fagg

(x))2)

= Bias⇣fagg

(x)⌘2

+ V ar(fagg

(x)) + V ar(").

However since Bias(f(x,L)) = Bias(fagg

(x)), we conclude that the improvement in the condi-

tional mean squared error purely originates from an improvement of the variance of the aggregated

predictor. By equation (3.15), we also obtain a decrease in the mean squared error since

E((Y � f(X,L))2) = E(E((Y � f(X,L))2|X))

� E(E((Y � fagg

(X))2|X)) = E((Y � fagg

(X))2).

We can conclude that aggregating improves the predictive capacity of an estimation procedure in

terms of the mean squared error by reducing its variance.

Bagging

Unfortunately, we usually do not have the luxury of arbitrarily many duplicates of L. Instead,

the idea of bagging is to take bootstrap samples {L(b)}b=1,...,B from L to determine a bootstrapped

predictor for fagg

analogously to Section 2.3.2. The bootstrapped predictor is obtained by taking

the average over the bootstrap samples resulting in the predictor

fbagg

(x) =1

B

BX

b=1

f(x,L(b)). (3.16)

For B ! 1, we have that

limB!1

1

B

BX

b=1

f(x,L(b)) = EP

N

(f(x,L(1))), Pn

-a.s.


This procedure is called bagging, an acronym for bootstrap aggregating. Whether bagging improves

accuracy depends largely on the stability of the procedure on which f(x,L) is based. If changes in

the training data L produce small changes in f(x,L), i.e. small variance V ar(f(x,L)), then fagg

(x)

will be close to f(x,L). However large improvement will occur if the base procedure is unstable,

where small changes in L can result in large changes in f(x,L). On the other hand, the impact

of bootstrapping will also depend on the stability of the base procedure and the richness of the

data. If we consider a stable method, the loss in accuracy due to the transition from f(x,L) to the

estimate based on the bootstrap sample f(x,L(1)) will be small, since the bootstrapped sample L(1)

is expected to be similar to L. On the other hand for unstable methods, even small changes result

in large changes of the estimate. And so we expect f(x,L) to be better than f(x,L(1)). In chapter

4 of Breiman (1996), an example for the cross-over point between instability and stability, beyond

which the bagged predictor fbagg

(x) does worse than f(x,L), is given.

Remark 3.24. As shown in the present section, aggregation in general reduces the mean squared

error. Since the bias is left untouched, this results in a reduction of the variance. When considering

di↵erent loss functions, e.g. the deviance statistics, there does not exists a similar decomposition into

bias and variance. Hence we cannot draw the same conclusion of an improvement of the predictive

capacity due to variance reduction. However empirical results for simulated insurance claims data

show that the variance reduction due to aggregating also positively e↵ects the deviance statistics,

see Section 6.2.3. J

3.4.2 Bagged Tree Predictor

So far, all considerations on bagging are fully general and applicable to any estimation procedure.

However, as shown above, the improvements of aggregation are in particular notable for procedures

with high variance V ar(f(x,L)). In this section, we discuss two di↵erent sources of variability

contained in the tree algorithm. Based on these considerations, we conclude that bagging has a large

potential of improving the generalization performance of classification and regression trees. The

literature of this section is based on Berk (2008) as well as Shalizi (2009).

The first source of uncertainty of a tree predictor lies in the variability of the estimates within a

given partition. This issue can be solves by ensuring that the leaves have a minimal size. In the rpart

function this is controlled by the parameter minbucket. In Section 5.3.1 we show how a reasonable

choice for insurance claims data can be derived. The second source lies in how splitting variables

are chosen. In each node the sample size may be significantly reduced, which very often leads to

the situation, where several splitting variables result in a similar reduction of the loss function.

Then already small changes in the training data may lead to choosing a di↵erent splitting variable.

This e↵ect is amplified for highly correlated covariates. Since the algorithm proceeds in stage-wise

manner, a chosen split a↵ects all succeeding splits. If changes occur early in the tree the impacts are

even bigger. Based to these arguments, we have conducted a simulation study in the second part of


the thesis (see Section 5.3.7) comparing the variability of the tree predictor to a generalized linear

regression model. The findings in Section 5.3.7 support the intuition of high variance related to tree

models.

3.4.3 Random Forest

For bagging we saw that the improvement comes purely from variance reduction. Therefore, we

further elaborate the variance of the tree predictor. Since the L(b) are identically distributed, denote

for all b = 1, . . . , B, the variance of a bootstrapped tree by

�2(x) = V ar(f(x,L(b))).

The next lemma gives more detailed insight into the variance of the bagged tree predictor.

Lemma 3.25. Assume that the sampling correlation between any pair of tree predictors is equal and

given by

⇢(x) = corr(f(x,L(i)), f(x,L(j))), for i 6= j.

Then the variance of the bagged tree predictor can be written as

V ar(fbagg

(x)) = ⇢(x)�2(x) +1� ⇢(x)

B�2(x). (3.17)

Proof. The proof follows by a direct calculation,

V ar(fbagg

(x)) = V ar

1

B

BX

b=1

f(x,L(b))

!=

1

B2

BX

i,j=1

Cov(f(x,L(i)), f(x,L(j)))

=1

B2

0

@BX

i,j=1,i 6=j

Cov(f(x,L(i)), f(x,L(j))) +BX

i=1

V ar(f(x,L(i)))

1

A

=1

B2

�(B2 �B)�2(x)⇢(x) +B�2(x)

�

= ⇢(x)�2(x) +1� ⇢(x)

B�2(x),

where in the third equality we used that Cov(f(x,L(i)), f(x,L(j))) = �2(x)⇢(x), as well as the fact

that the first sum has B2 �B summands.

As B ! 1, the second term tends to zero, but the first term remains. Hence the size of the

correlation between a pair of tree predictors limits the benefits of averaging.

The idea of Breiman (2001) is to decouple the single trees used for aggregation in order to decrease

the correlation and hence further improve the predictor. This is achieved by random covariate

selection: When growing a tree, in each step of the recursion, we randomly choose m of the p


covariates. We then only search for the optimal split within these m covariates. We denote by

f(x,⇥(L)) a single random forest tree, where ⇥(L) characterizes the random forest tree in terms of

splitting variables, split-points at each node and leaf values. Analogously to bagging, the random

forest predictor is obtained by aggregation, see Algorithm 3.3. For an individual random forest tree

Algorithm 3.3 Random Forest for Regression

Denote by L a learning sample.for i = 1, . . . , B do

1. Generate a bootstrap sample L⇤ by drawing with replacement from the data L.2. Grow a tree to the bootstrapped data, by recursively repeating the following steps for each

leaf of the tree, until the maximum number of leaves (maxnodes) is reached or no furthersplit is possible:

(a) Randomly select m of the p variables.

(b) Split the leaf according to the best variable/split-point among the m covariates.

Resulting in a random forest tree f(x,⇥(L)), where ⇥(L) characterizes the random foresttree in terms of splitting variables, split-points at each node and leaf values.

end for

Based on the ensemble of trees {f(x,⇥b

(L))}b=1,...,B, output the random forest predictor

fB

rf

(x) =1

B

BX

b=1

f(x,⇥b

(L)). (3.18)

the sampling correlation ⇢m

(x) and sampling variance �2m

(x) are given by

⇢m

(x) = corr(f(x,⇥i

(L)), f(x,⇥j

(L))), for i 6= j

resp.

�2m

(x) = V ar(f(x,⇥(L))),

with ⇢m

(x) = ⇢(x) resp. �2m

(x) = �2(x) for m = p. By equation (3.18), in the limiting case for

B ! 1, the random forest predictor is given by

frf

(x) = E(f(x,⇥(L))|L), Pn

-a.s.

Analogously to equation (3.17), the variance of the random forest predictor is given by

V ar(frf

(x)) = ⇢m

(x)�2m

(x).

When decreasing m we expect a decrease in the sample correlation ⇢m

(x), since the randomized

variable selection decouples the individual random forest trees (see Figure 3.5). For the variance


of a single random forest tree �2m

(x), the e↵ects are less obvious. Applying the tower property for

conditional variance, we get for the variance of a single tree predictor

�2m

(x) = V ar(E(f(x,⇥(L))|L)) + E(V ar(f(x,⇥(L))|L))

= V ar(frf

(x)) + E(V ar(f(x,⇥(L))|L)),

where the first term is the variance of the random forest predictor, which we hope is decreasing, when

decreasing m. The second term denotes the variance due to randomization, which is going to increase

when decreasing m, since there is more randomization. Overall the parameter m should therefore

be seen as a tuning parameter, which controls the possible improvements due to randomization.

In order to illustrate the e↵ects of the tuning parameter m upon the random forest predictor, we

have simulated from

Y =1p50

50X

i=1

Xi

+ ",

where all Xi

and " are i.i.d. normally-distributed with mean zero and variance one. We have used

500 training sets of size 100 and one test set with 600 observations. Figure 3.5 shows the sample

correlation ⇢m

(x), the tree variances �2m

(x) as well as their product V ar(frf

(x)) = ⇢m

(x)�2m

(x)

as a function of the number of randomly selected splitting variables m. The boxplots represent

the respective values at the 600 testing data points. This shows that we can reduce the variance

●

●

●

●●●

●

●●

●

●

●

●

●

●●

●

●

●

●

●●

●●●

●

●

●

●

●●

●

●●

●

●●●

●

●

●

●

●

●●

●●

●

●

●●

●

●

●●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●●

●

●

● ●●

●

●

● ●

●

●

●

●

1 7 13 19 25 31 37 43 49

0.04

0.06

0.08

0.10

0.12

m

Sam

ple

corre

latio

ns b

etwe

en tr

ees

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●●

●

●

●●

●

●●

●

●

1 7 13 19 25 31 37 43 49

1.0

1.5

2.0

m

Varia

nces

of s

ingl

e ra

ndom

fore

st tr

ee

●●

●●●●

●

●

●

●

●

●

●

●

●

●●

●●●●●

●●●●

●

●●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

1 7 13 19 25 31 37 43 49

0.05

0.10

0.15

m

Varia

nce

of ra

ndom

fore

st

Figure 3.5 – Simulation study for reducing the variance due to a reduction of the sample correlationbetween individual random forest trees.

of the random forest predictor drastically, when reducing m. In order to eventually improve the

predictive capacity, we also need to keep the bias in mind. The randomization and reduction of

the sample space when growing a random forest tree typically increases the bias. In Figure 3.6,

we have shown the average bias, average variance and mean squared error (see equation (2.4)) as

a function of m. We observe the mentioned increase in bias, when decreasing m, i.e. reducing the

sample space and increasing the randomization. Therefore we face a classical bias-variance-trade-o↵

in the choice of m. In the literature it is often suggested to use m = bp/3c, where p denotes the


●

●

●● ● ● ● ● ● ● ● ● ● ● ● ● ●

0 10 20 30 40 500.

750.

800.

850.

900.

951.

00

m

Mea

n Sq

uare

d Er

ror a

nd B

ias

Squa

red

●

●

●

●

●●

●●

● ● ● ● ● ● ● ● ●

●

●

●

●

●

●●

●●

●●

●●

●●

●●

m

0.00

0.03

0.06

0.09

0.12

0.15

Varia

nce

Mean Squared ErrorAverage Bias SquaredAverage Variance

Figure 3.6 – Bias-variance-trade-o↵ of the random forest predictor as a function of the tuning parameterm.

number of covariates. However m should be seen as a tuning parameter, and di↵erent values should

be tested in practice. So far we have not discussed the parameter maxnodes, which limits the size

of the individual random forest trees. Based on the findings in Part 2 on simulated and real data,

we suggest to choose maxnodes slightly larger than the size of an individual tree predictor chosen

according to the minimum cross-validation error rule, see Section 5.3.3 below.

3.4.4 Model Visualization

Since trees allow for very complex interactions between the predictor variables it is di�cult to

understand the individual influence of each covariate. For single tree predictors, we can use the tree

representation to visualize the predictor as well as the interactions. For tree ensembles, there does

not exists an extension of the tree representation. In the literature, such models are often referred

to as Black Box Models. In this section, we discuss three tools for black box visualization: Variable

Importance, Partial Dependence Plots and Individual Conditional Expectation Plots introduced in

Goldstein et al. (2014).

Variable Importance

For black box models a measure of variable importance is even more desired. Fortunately, we can

generalize the idea of variable importance for single tree predictors introduced in Section 3.3.3 to tree

ensembles. We proceed in calculating the variable importance for each covariate of a tree predictor

used in a tree ensemble and output the average. Formally, let T1, . . . , Tm

be the single tree predictors

contained in the ensemble. For each tree, we can calculate the variable importanceM(x, T ) according

to Definition 3.21, where x 2 {x1, . . . , xp} and T 2 {T1, . . . , Tm

}. The variable importance for the

tree ensemble is given by

M ens(x) =1

m

mX

i=1

M(x, Ti

).


Partial Dependence and Individual Conditional Expectation Plots

The partial dependence plot shows the influence of a covariate Xi

for i 2 {1, . . . , p} when all other

covariates are integrated out. Formally, fix a covariate index i 2 {1, . . . , p} and define the set of

remaining covariates C := {1, . . . , p}\{i}. Denote by X

C

:= (Xj

: j 2 C) 2 Rp�1 the vector of the

corresponding covariates. Then the partial dependence function fi

: R ! R is given by

fi

(x) = E(f(Xi

,XC

)|Xi

= x) =

Zf(x,X

C

)dP (XC

), (3.19)

where f denotes the model function (2.1) and P (XC

) denotes the distribution of XC

. As neither f

nor P (XC

) are known, we estimate equation (3.19) by

fi

(x) =1

n

nX

j=1

fn

(x,xj,C

), (3.20)

where x

j,C

, for j = 1, . . . , n denote the observed values of the covariates X

C

. Plotting x against

fi

(x) displays the average value of fn

(·) as a function of the i-th covariate. However, the results must

be interpreted with caution.

Example 3.26. This example is taken from Goldstein et al. (2014). Consider the data generating

process

Y = 0.2X1 � 5X2 + 10X21{X3�0} + ", (3.21)

where " ⇠ N (0, 1) and X1, X2, X3i.i.d.⇠ U [�1, 1]. We have simulated 1’000 observations from model

(3.21) and fit a random forest predictor using the R package randomForest. Figure 3.7 shows the

true values of f(x1, x2, x3) = 0.2x1 � 5x2 + 10x21{x3�0} versus x2 for the observed data. The right

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

● ●

●●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

● ●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

● ●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●●

●

● ●

●●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●●

●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●

●●

●

●●

●

●

●

●

●

●

●

●

●

−1.0 −0.5 0.0 0.5 1.0

−4−2

02

4

x2

f

(a) Scatter plot of f(x1, x2, x3) versus x2

−1.0 −0.5 0.0 0.5 1.0

−1.0

−0.5

0.0

0.5

1.0

x2

f.hat_2

(b) Partial dependence plot for X2

Figure 3.7 – Example of a misleading partial dependence plot for dependent covariates.


panel of Figure 3.7 shows the corresponding partial dependence plot for covariate X2. It suggest

that on average X2 is not associated with Y . However from model equation (3.21), we see that X2

has either a positive e↵ect on Y if X3 � 0 or a negative e↵ect if X3 < 0. In the partial dependence

function these e↵ects chancel each other out due to symmetry. Therefore the partial dependence

plot may be misleading, if the model contains interactions. ?

Apart from dependencies between the covariates also their marginal distribution plays a crucial role

in the quality and interpretability of the partial dependence plot.

Example 3.27. Note that the estimation in (3.20) is twofold: First of all it is an estimate of the

expected value in equation (3.19). But it is also an estimate of the unknown model function f(·). Thisshould be kept in mind when using the partial dependence plot. Consider the following conceptual

model shown in Figure 3.8a, which attains the values 0 or 1 according to the covariates X1 and X2.

The blue lines denote the marginal distributions of the covariates X1 and X2. In the top left corner,

we have particularly few observations. On the other hand, in the bottom right corner we have the

majority of the observed data points. When fitting a model, this may largely distort the predictor.

In Figure 3.8b a conceptual example of a fitted model is shown, which is distorted by the marginal

distribution of the covariates. Therefore the quality of the estimated partial dependence function

also depends on the quality of the fitted predictor. However note that the marginal distributions are

1 0

0 1

X2

X1

(a) Model with marginal distribution

0 0

1 1 X1

X2

(b) Distorted model fit

Figure 3.8 – Example of a distorted fit due to skewed marginal distributions.

also of relevance for the interpretation of the partial dependence plot if the fitted model successfully

recovers the characteristics shown in Figure 3.8a. In Figure 3.9 the partial dependence function for

the covariate X1 corresponding to the model in Figure 3.8a is shown. We see that the resulting

function is heavily distorted by the skewed marginal distribution of the covariate X2. For small

values of X1, the partial dependence function is nearly zero since the contribution of covariate values

for which the model attains 1 is very small due to the skewness of X2, analogously for large values

of X1.


0.0

0.2

0.4

0.6

0.8

1.0

Parti

al D

epen

denc

e Fu

nctio

nX1

Figure 3.9 – Distorted partial dependence function due to skewed marginal distributions.

?

The individual conditional expectation (ICE) plot disaggregates the output of the partial dependence

plot. Rather than plotting the average, we may also plot all n estimated curves fn

(x,xj,C

) for

j = 1, . . . , n. The terminology individual conditional expectation plot comes form the fact that

fn

(x) is an estimate for the conditional expectation f(x) = E(Y |X = x), see Section 2.1. Each

curve fn

(x,xj,C

) shows the predicted response as a function of the i-th covariate for the remaining

covariates xj,C

fixed. Figure 3.10 shows the ICE plot for Example 3.26. The black line denotes the

estimated partial dependence function. Parallel curves indicate that the e↵ect of the i-th covariate

on the response Y does not depend on other covariates. Figure 3.10 on the other hand shows clear

dependence of X2 on further covariates.

−1.0 −0.5 0.0 0.5 1.0

−4−2

02

4

x2

f.hat_2

Figure 3.10 – ICE plot corresponding to Example 3.26. The intersecting lines suggest the presence ofan interaction between X2 and further covariates.

Part II

Data Analysis

Chapter 4

Motor Insurance Collision Data

In this chapter, we introduce the general objectives in pricing of non-life insurance contracts based

on Wuthrich (2015). We then describe the Motor Insurance Collision Data Set of AXA Winterthur

as well as a data generating process derived from the characteristics observed in the collision data

set.

4.1 Non-Life Insurance Pricing

”Non-life insurance pricing is the art of setting the pricing of an insurance policy, taking into con-

sideration various properties of the insured object and the policyholder. The main source on which

to base the decision is the insurance company’s own historical data on policies and claims. In a

tari↵ analysis, an actuary uses this data to find a model which describes how the claim cost of an

insurance policy depends on a number of explanatory variables.” (see preface, page vii, of Ohlsson

and Johansson (2010)).

A non-life insurance policy may cover damages on a car, house or other property, or losses caused

to another person (third party liability). For a company, the insurance may cover property damages,

cost of business interruption and more. In e↵ect, any insurance that is not a life insurance product

is classified non-life insurance sometimes also called property and casualty (P&C) insurance. Since

not every policy has the same expected loss, there is a need for statistical methods to capture this

feature. Historically, the pricing of insurance contracts has been strongly regulated and did not

involve much statistical analysis. In fact, Switzerland was the last European country to get rid

of the cartel rates regulations in 1996. Since then non-life insurance pricing has changed towards

charging a fair premium in the sense that each policyholder pays a premium that corresponds to the

expected losses transferred to the insurance company. Before we start with the statistics, we define

a mathematical framework suitable for modeling insurance claims associated to a policyholder.

Definition 4.1 (Total Claim Amount). The total claim amount Sm

of policy m is given by the

43

4.1. NON-LIFE INSURANCE PRICING 44

compound distribution

Sm

= Y(m)1 + . . .+ Y

(m)N

m

=N

mX

i=1

Y(m)i

,

where

1. the claims count Nm

is a discrete random variable supported on A ⇢ N0,

2. the claim size (Y (m)i

)i�1

i.i.d.⇠ Gm

with Gm

(0) = 0,

3.⇣Y

(m)i

⌘

i�1and N

m

are independent.

Since Nm

plays the role of the number of claims of policyholder m, it is natural to assume that it

may only take integer values. In the context of insurance pricing, the number of claims should always

be understood in relation to an underlying volume vm

> 0. For our purpose, we use the amount

of time, that an insurance company is liable for the risk of the policy, usually measured in years.

We will use the terminology years at risk for the underlying volume vm

. The ratio Nm

/vm

is called

claims frequency with expected claims frequency denoted by

E✓N

m

vm

◆= �

m

.

The second assumption in Definition 4.1 specifies the individual claim sizes. The condition Gm

(0) = 0

implies that the claims Y(m)i

are strictly positive P -a.s. Within this framework we can give more

detailed insight into the expected total claim amount E(Sm

).

Proposition 4.2. Let Sm

be the total claim amount of policyholder m. Then the expected total claim

amount is given by

E(Sm

) = E(Nm

)E⇣Y

(m)1

⌘= �

m

vm

E⇣Y

(m)1

⌘.

Proof. The proof is an application of the tower property for conditional expectation,

E(Sm

) = E

N

mX

i=1

Y(m)i

!= E

E

N

mX

i=1

Y(m)i

��Nm

!!

= E

N

mX

i=1

E⇣Y

(m)i

��Nm

⌘!= E

N

mX

i=1

E⇣Y

(m)i

⌘!

= E (Nm

)E⇣Y

(m)1

⌘,

where we used the tower property in the second, linearity of the expectation in the third, inde-

pendence of Nm

and⇣Y

(m)i

⌘

i�1in the fourth and the identically distributed property in the last

equality.

4.2. MOTOR INSURANCE COLLISION DATA 45

As mentioned above, a fair premium (without loadings) is seen to be the expected loss E(Sm

)

transferred to the insurance company with respect to the amount of time the insurer was liable for

policy m. We define the pure premium as the expected total claim amount per duration or using

Proposition 4.2,

Pure Premium = E✓Sm

vm

◆= E

✓N

m

vm

◆E⇣Y

(m)1

⌘= �

m

E⇣Y

(m)1

⌘(4.1)

= Expected Claims Frequency · Expected Claim Severity.

Motivated by (4.1), insurance companies set up two separate models: claims frequency and claim

severity. For the remainder, we focus on claims frequency models.

4.2 Motor Insurance Collision Data

The Motor Insurance Collision Data Set contains all information collected by AXA Winterthur on

policies with a fully comprehensive (Vollkasko) car insurance with collision cover included from 2010

until 2014. For the data analysis we focus on collision claims only. In total the data consists of

approximately 450’000 policies. A list of all covariates we consider in this thesis can be found in the

Appendix in Table D.1. We begin with a few comments on the data structure (see Table 4.1 for an

illustration of the structure):

rec pol beg end yr nc ac x1 . . . xp

1 1 01.01.09 12.04.09 0.279 1.0043 3487 a1 · · · b1

2 1 13.04.09 31.12.09 0.721 0 0 a1 · · · b1

3 2 01.01.09 31.12.09 1 0 0 a2 · · · b2

4 3 01.01.09 17.07.09 0.543 0 0 a3 · · · b3

5 3 18.07.09 15.12.09 0.416 0 0 a30 · · · b3

6 3 16.12.09 31.12.09 0.041 1.0043 2676 a30 · · · b3

......

......

...... · · ·

... · · ·...

Table 4.1 – Record structure of the data provided by AXA Winterthur.

AXA generates a new record for every policy at the end of each accounting year or if there has

been a change in the data of the policy. The latter might occur when the client has reported a

claim but also simply when the client has moved to a neighbouring village, i.e. a change in the

zip code (plz). Using this system, we can capture the development of all covariates for each

policy. For example from record 4 to record 5 the covariate x1 changed from a3 to a30.

In order to account for incurred but not reported (IBNR) claims, the missing number of claims


is estimated and uniformly added to the number of claims of observations where a claim has

been reported. This explains why the number of claims (nc) is not integer-valued. When fitting

a GLM model with a Poisson assumption (see Ohlsson and Johansson (2010) or Chapter 7.3

in Wuthrich (2015)) for the number of claims, we need to be aware of this fact and adjust the

data accordingly.

If a policyholder encounters two claims on the same day, the dataset has an additional obser-

vation containing the second claim with duration set to zero. This will lead to problems when

calculating the claims frequency. In order to avoid this problem, the duration is manually set

to 0.001.

4.2.1 Data Management

Based on the particular structure of the Motor Insurance Collision Data Set, we need to prepare the

data, before we can perform a statistical analysis. As shown in Table 4.1, every time a covariate has

changed, AXA generates a new record in the database. This might artificially reduce the durations

of the policies and therefore increase their frequencies. Hence we manipulated the data, resulting in

two types of data sets:

Variations of the Data Set:

Data Set A:

1. Identify all relevant covariates considered in the statistical analysis.

2. Merge all records of the same policy where no changes in any relevant covariates have

occurred. For example, in Table 4.1 if the covariate x1 is irrelevant for our purpose, then

record (rec) 4 and 5 can be merged. Analogously, since no covariates have changed also

the records 1 and 2 can be merged.

When merging records the claim amounts (ac), number of claims (nc) and the durations (yr)

need to be added. The changes are only due to data management and no information gets lost.

However, there still remain policies with durations smaller than one. The problem becomes

more severe, if we consider many covariates in step 1 as relevant.

Data Set B:

1. Identify all relevant covariates considered in the statistical analysis.

2. Merge all records of the same policy.

The resulting data set will mostly contain observations with duration one. Only policies that

were cancel or signed during the year or policies which were not active during the whole year

(e.g. vintage cars) do not have a full year at risk.


Example 4.3. Figure 4.1 shows the proportions of policies with duration one contained in Data

Sets A and B, when considering the covariates age of driver, nationality, price of car and kilometres

yearly driven as relevant. The blue area denotes the total amount of observations in each data set

(normalized to 100 for Data Set A) and the grey area shows the amount of policies with a full year

at risk. For the analysis we have used the data of one accounting year.

Data Set A Data Set B

020

4060

8010

00

2040

6080

100

Figure 4.1 – Proportions of observations with a full duration according to the variation of data set.

?

4.2.2 Descriptive Statistics

To get an understanding for the claims count distribution introduced in Definition 4.1, we examine a

few descriptive statistics first. In this section, we restrict the data to one accounting year. In Figure

4.2, we see that the number of claims in Data Set A lies between zero and five.

Number of Claims (nc)

Num

ber o

f Obs

erva

tions

0 1 2 3 4 5

0e+0

01e

+05

2e+0

53e

+05

4e+0

5


log(

Num

ber o

f Obs

erva

tions

)

0 1 2 3 4 5

01

23

45

Figure 4.2 – Histograms of the number of claims (nc) for Data Set A.


From Figure 4.3, we see that the observed frequencies go up to 170 claims per year. Comparing with

Figure 4.2, we conclude that these outliers are due to small durations.

Claims Frequency

Num

ber o

f Obs

erva

tions

0 50 100 150

0e+0

01e

+05

2e+0

53e

+05

4e+0

5

DurationN

umbe

r of O

bser

vatio

ns

0.0 0.2 0.4 0.6 0.8 1.0

0e+0

01e

+05

2e+0

53e

+05

4e+0

5

Figure 4.3 – Histograms of the claims frequency and duration for Data Set A.

The same analysis can be conducted for Data Set B. Due to the exhaustive merging, we expect less

observations with small volumes. A histogram of the observed claims counts and frequencies is shown

in Figure 4.4.


Num

ber o

f Obs

erva

tions

0 1 2 3 4 5

0e+0

01e

+05

2e+0

53e

+05

4e+0

5

Claims Frequency

Num

ber o

f Obs

erva

tions

0 10 20 30 40 50 60

0e+0

01e

+05

2e+0

53e

+05

4e+0

5

Figure 4.4 – Histograms of the number of claims (nc) and claims frequency for Data Set B.

Univariate Analysis

In order to better understand the data, we describe a few univariate statistics to get a first impression.

Since we cannot visualize a high-dimensional covariate space, we mainly focus on variables that are


likely to have an impact on the expected claims frequency. Some of them are also included in the

pricing model used by AXA Winterthur. Explicitly, we consider the tari↵ criteria age of driver,

nationality, price of car as well as kilometres yearly driven for the univariate analysis.

In order to make sound statistical statements, we always want to use some version of the law of

large numbers based on multiple observation of the same underlying stochastic nature. Unless we

have observed several years of data for each policy, it is di�cult to treat each policyholder separately.

Denote by N(t)i

for t 2 {2010, . . . , 2014} the claims count of policy i in year t with expected claims

frequency �(xi

), where x

i

2 Rp are the covariates of policyholder i. Assuming that the claims

counts of policy i are independent N (t)i

⇠ Poi(�(xi

)v(t)i

), we may use the average frequency over the

observed 5 year time period as a predictor for the expected claims frequency

�(xi

) =1

P2014t=2010 v

(t)i

2014X

t=2010

N(t)i

.

The one standard error confidence bounds for �(xi

) are given by,

E⇣�(x

i

)⌘± V co

⇣�(x

i

)⌘E⇣�(x

i

)⌘= �(x

i

)± V co⇣�(x

i

)⌘�(x

i

). (4.2)

Using the Poisson assumption, the coe�cient of variation is given by

V co⇣�(x

i

)⌘=

1q�(x

i

)P2014

t=2010 v(t)i

.

Considering the Motor Insurance Collision Data Set the expected claims frequencies lie in �(xi

) 2[5%, 25%]. Assuming that the policyholder is a fairly good driver with an expected claims frequency

of 10% and that he was insured over the whole period, i.e. v(t)i

= 1 for all t 2 {2010, . . . , 2014}, weget a fairly large one standard error confidence interval of �(x

i

) 2 [�4%, 24%].

Therefore, instead of treating every policyholder separately, actuaries form subportfolios with

similar risk characteristics. Meaning we partition the covariate space in di↵erent groups such that

we still capture the characteristics of the covariates whilst ensuring enough observations per group

to be able to make statistical inference. Let N i

1, . . . , Ni

m

i

denote the claims counts of policies within

risk class i. Assume that N i

j

⇠ Poi(�i

vj

) and independent for j = 1, . . . ,mi

and for all i. Then the

total number of claims of risk class i is given by

N i :=m

iX

j=1

N i

j

⇠ Poi(�i

m

iX

j=1

vj

), (4.3)

and denote by vi :=P

m

i

j=1 vj the volume of risk class i. Then the coe�cient of variation of the sample


claims frequency N i/vi of risk class i is given by

V co

✓N i

vi

◆=

1q�i

Pm

i

j=1 vj.

Again consider a class of fairly good drivers with expected claims frequency of �i

= 10%. In order

to achieve a confidence interval for our estimate �i

= N i/vi 2 [9.5%, 10.5%], we need to solve

|V co(�i

)�i

| < 0.5%.

This reveals that vi > 4000, i.e. we need a volume of 4000 years at risk in risk class i. This shows

how the uncertainty of a predictor can be reduced by forming homogeneous subportfolios but also

that in an insurance context we need rather large portfolios. Table 4.2 shows a summary of the data

and subportfolios that we consider for the univariate analysis.

Variable Name Type Values Description

ac integer � 0 Total claim amount

nc integer N0 Number of claims

yr numerical [0, 1] Duration (years at risk)

age gr categorical 1 - 8 Age of driver encoded in 8 groups

nat gr categorical 1 - 4 Nationality of driver encoded in 4 groups

price gr categorical 1 - 8 Price of the car encoded in 8 groups

km gr categorical 1 - 5 Kilometres yearly driven encoded in 5 groups

Table 4.2 – Response variables and grouped covariates used for univariate analysis.

Note that in general the estimator �i

= N i/vi does not conincide with the sample mean, i.e.

N i

vi=

1Pm

i

j=1 vj

m

iX

j=1

N i

j

=1

mi

m

iX

j=1

miP

m

i

j=1 vjN i

j

6= 1

mi

m

iX

j=1

N i

j

vj

.

Due to outliers in the observed claims frequencies N i

j

/vj

caused by small durations (see Figure 4.3),

the use of the sample mean for the expected claims frequency is questionable.

Lemma 4.4 (Estimator for expected claim frequency). For j = 1, . . . ,mi

, let N i

j

denote the number

of claims of policy j in risk class i with expected claim frequency E(N i

j

/vj

) = �i

. Then

�i

=1P

m

i

j=1 vj

m

iX

j=1

N i

j

(4.4)


is an unbiased estimator for �i

.

The proof follows by a direct calculation. Note that (4.4) balances small durations vj

by first

summing them up. This results in an estimator, which is robust against outliers caused by small

volumes vj

. Alternatively, we can assume that the policies within each risk group have independent

Poisson-distributed claims counts N i

j

with the same expected claims frequency, then the estimator

(4.4) is also the maximum likelihood estimator.

Lemma 4.5. Let N i

1, . . . , Ni

m

i

be independent and Poi(�i

vj

)-distributed for �i

> 0. Then the max-

imum likelihood estimator for �i

is given by

�i

=1P

m

i

j=1 vj

m

iX

j=1

N i

j

.

The proof for Lemma 4.5 follows from Example 2.3. Having introduced the concept of risk classes,

we did not yet explain how these groups are formed. In practice, actuaries determine these risk

groups exogenously keeping in mind the trade-o↵ between capturing the covariates’ characteristics

and the total volume of the group. Alternatively, we can use the tree construction algorithm to fit a

univariate tree which does not require an a priori specification of the risk groups. The resulting leaves

can be interpreted as risk classes. For the univariate analysis below, we consider both risk groups

used by AXA and risk groups obtained by the use of the tree algorithm. More details on determining

risk groups from regression trees can be found in Section 5.4. To get a better understanding of the

data, we show for each tari↵ criterion and risk group the observed average claims frequency using

the estimator (4.4). Additionally, we also provide error whiskers of plus minus one standard error.

The standard errors were derived assuming the claims counts N i

j

within risk group i are independent

Poi(�i

vj

)-distributed. It follows that,

V ar(�i

) =1

⇣Pj

vj

⌘2X

j

V ar�N i

j

�=

1⇣P

j

vj

⌘2X

j

�i

vj

=�iPj

vj

,

where we used the independence in the first equality and the Poisson assumption in the second.

Therefore the one standard error whiskers are given by

�i

±

s�iPj

vj

.

Remark 4.6. Using the central limit theorem (or directly by (4.3)), we can quantify the probability,

that the predictor �i

lies in the one standard error confidence bounds (4.2),

P⇣�i

2h�i

� V co⇣�i

⌘�i

,�i

+ V co⇣�i

⌘�i

i⌘⇡ 70%.


The proof can be found in Appendix A.5. J

Tari↵ Criterion 1: Age of Driver

First we consider the tari↵ criterion age of driver. The risk groups chosen by AXA are shown in the

left panel of Figure 4.5. The blue bars denote the average claims frequency within each risk groups,

whereas the grey bars denote the total volume of each group. Figure 4.5 shows that there are large

di↵erences in the claims frequency with respect to age of driver. This indicates that the covariate

plays an important role in a statistical analysis of the expected claims frequency. This result is not

very surprising, since we expect younger drivers to be less experienced and therefore more likely

to be involved in a car accident. We might expect a similar behavior for the older drivers, due to

decreasing reaction speed or loss of good eyesight. The reason why we do not see this increase in

Figure 4.5a, is explained by the fact that the last risk group contains a rather large range of di↵erent

age classes. This large age group contains roughly 15% of the whole portfolio. On the other hand we

might not be capturing the true dependency of the response on the covariate age of driver. We have

repeated the above analysis with a refinement of the last age group. Figure 4.5b shows the expected

U-shape for the modified groups. The third plot shows the results obtained when applying the tree

algorithm for choosing the optimal risk groups. Although less groups are chosen, the tree captures

the same characteristics. Overall the frequencies of the subportfolios vary between roughly 9% and

19% (based on the risk groups obtained by the tree predictor), supporting the relevance of age of

driver when modeling the expected claims frequency of a policyholder.

Age Group

Freq

uenc

y

18−2

021−2

526−3

031−4

546−5

556−6

061−6

566−9

80.00

0.05

0.10

0.15

0.20

0.25

0.00

0.05

0.10

0.15

0.20

0.25

010

2030

Volu

me

(in %

)

(a) Age Groups (AXA)

Age Group

Freq

uenc

y

0.00

0.05

0.10

0.15

0.20

0.25

18−2

021−2

526−3

031−4

546−5

556−6

061−6

566−7

071−8

080−9

90.00

0.05

0.10

0.15

0.20

0.25

010

2030

Volu

me

(in %

)

(b) Age Groups (customized)

Age Group

Freq

uenc

y

0.00

0.05

0.10

0.15

0.20

0.25

18−2

223−2

425−2

627−3

435−5

051−7

172−7

778−8

1>8

10.00

0.05

0.10

0.15

0.20

0.25

010

2030

Volu

me

(in %

)

(c) Age Group (Tree)

Figure 4.5 – Bar plots showing the average claims frequency per age group (in blue) obtained from theMotor Insurance Collision Data Set. The left plot shows the age groups chosen by AXA, the middleplot the refined risk groups and the right plot the risk groups chosen by the tree predictor. The dottedhorizontal lines show the overall portfolio averages. Additionally to the average frequency, the grey barsshow the total volumes of the risk groups.


Tari↵ Criterion 2: Nationality

The second tari↵ criterion we consider is nationality. In total there are around 160 di↵erent na-

tionalities within the Motor Insurance Collision Data Set. However among them are many exotic

nationalities, e.g. Macao, that contain only very few observations resulting in a total volume of less

than one year at risk. Hence the need to create reasonable risk groups is even more present for these

kinds of covariates. AXA has created 4 risk groups, shown in the left panel of Figure 4.6. In the

1 2 3 4

Nationality Group

Freq

uenc

y

0.00

0.05

0.10

0.15

0.20

0.00

0.05

0.10

0.15

0.20

020

4060

80

Volu

me

(in %

)

(a) Nationality Groups (AXA)

1 2 3 4

Nationality Group

Freq

uenc

y

0.00

0.05

0.10

0.15

0.20

0.00

0.05

0.10

0.15

0.20

020

4060

80

Volu

me

(in %

)

(b) Nationality Groups (Tree)

Figure 4.6 – Bar plots showing the average claims frequency per nationality group (in blue) obtainedfrom the Motor Insurance Collision Data Set. The dotted horizontal lines show the overall portfolioaverages. Additionally to the average frequency, the grey bars show the total volumes of the risk groups.

right panel of Figure 4.6 we see the risk groups chosen by the tree predictor. In both cases, a large

group containing the low risk nationalities is chosen. Then there are two groups for medium risk

nationalities and one group for the high risks. Note that for the nationality analysis, we have used a

larger data set in order to have more observations for exotic countries. This is also the reason why

the overall portfolio average in Figure 4.6 does not exactly coincide with the averages shown for the

other tari↵ criteria. Overall the observed frequencies of the subportfolio vary between 11% and 20%

indicating an important role for predicting the claims frequency.

Tari↵ Criterion 3: Price of Car

The third tari↵ criterion is price of car. The risk groups chosen by AXA are shown in the left panel

of Figure 4.7. We observe an approximately linear increase of the frequency in the covariate price of

car. A similar structure is also captured by the corresponding univariate tree predictor. However, the

tree adds an additional split for very expensive cars above 175’000 CHF. This suggests that drivers


with a car, whose price exceeds a certain threshold start to drive more carefully. Another explanation

might be that these are cars, e.g. vintage cars, which are only driven at the weekends, i.e. there

is a correlation between price and another covariate such as kilometres yearly driven. Comparing

the risk groups chosen by AXA and by the tree algorithm, AXA’s risk groups are more equally

sized. However there is only a small di↵erence in the average frequency between the groups 20-25

and 25-30. Looking at the tree predictor, we see that there exists only one risk groups from 21-32,

which is basically a combination of the two mentioned before. This suggests that we can reduce the

number of risk groups whilst still capturing the underlying structure. Overall the observed claims

frequencies of the di↵erent subportfolios vary between 8% and 17%, which is again a clear indication

for an important covariate with respect to predicting the claims frequency.

<=20 20−25 25−30 30−35 35−40 40−50 50−70 >70

Price Group (in 1'000 CHF)

Freq

uenc

y

0.00

0.05

0.10

0.15

0.00

0.05

0.10

0.15

010

2030

Volu

me

(in %

)

(a) Price Groups (AXA)


Freq

uenc

y

<18

18−2

121−3

233−3

940−6

162−8

1

82−1

75>1

75

0.00

0.05

0.10

0.15

0.00

0.05

0.10

0.15

010

2030

Volu

me

(in %

)

(b) Price Groups (Tree)

Figure 4.7 – Bar plots showing the average claims frequency per price group (in blue) obtained fromthe Motor Insurance Collision Data Set. The dotted horizontal lines show the overall portfolio averages.Additionally to the average frequency, the grey bars show the total volumes of the risk groups.

Tari↵ Criterion 4: Kilometres Yearly Driven

The last tari↵ criterion we consider in the univariate analysis is kilometres yearly driven. The risk

groups chosen by AXA are shown in the left panel of Figure 4.8. Very similar to the price covariate,

we observe an increase of the average claims frequency in the amount of kilometres yearly driven.

This is in line with the intuition that when driving larger distances per year also the exposure to

possible accident events increases and therefore the likelihood of a claim. The chosen risk groups by

AXA are supported by the tree predictor. The only di↵erence is that the tree does not split the risk

group 6.5 to 14 into two separate groups as it is done by AXA. Overall the observed frequencies vary

between roughly 8% and 13%. Although there seems to be a di↵erence between the subportfolio,


the magnitude is less distinct as for the criteria previously considered. This may indicate that the

criterion is not as important for the claims frequency analysis.

<7 7−10 10−15 15−20 >20

KM Group (in 1'000 km)

Freq

uenc

y

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

010

2030

40

Volu

me

(in %

)

(a) KM Groups (AXA)

<6.5 6.5−14 14−18 >18

KM Group (in 1'000 km) Fr

eque

ncy

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.14

010

2030

40

Volu

me

(in %

)

(b) KM Groups (Tree)

Figure 4.8 – Bar plots showing the average claims frequency per KM group (in blue) obtained fromthe Motor Insurance Collision Data Set. The dotted horizontal lines show the overall portfolio averages.Additionally to the average frequency, the grey bars show the total volumes of the risk groups.

4.2.3 Multivariate Analysis

Ultimately, our goal is to have a multivariate model, taking dependencies of the covariates into

account, e.g. kilometres yearly driven and expensive cars. In Figure 4.9, the lower diagonal part

shows a scatter plot of the continuous covariates age of driver, price of car and kilometres yearly

driven. We observe a decrease in the kilometres yearly driven for cars with a price above 40’000

CHF. The upper diagonal shows the correlation coe�cients between the covariates. The correlations

do not show any sign of dependence between the covariates. On the diagonal, we see a histogram for

each covariate. For age of driver, we observed a normally-distributed shape. The covariates price of

car and kilometres yearly driven are plotted on a log-scale, hence reveal a heavily skewed shape with

a few very expensive car resp. drivers with a lot of driving kilometres per year.

4.3. SIMULATED INSURANCE CLAIMS COUNT DATA 56

Figure 4.9 – Scatter and correlation plot for the covariates age of driver, price of car and kilometresyearly driven. The figure is based on a random subsample of 10% of the whole data set.

4.3 Simulated Insurance Claims Count Data

Besides the Motor Insurance Collision Data Set we will also use simulated data sets. This has the

advantage that we can artificially manipulate certain characteristics in order to identify the key issue

that are important when fitting a tree predictor to an insurance data set. The simulated data tries

to recover the characteristics discovered in the previous section and depends on two covariates: age

of driver and price of car. More precisely, we have chosen the following model for the frequencies:

�(age, price) =8�0

100

✓5 +

(age� 50)2

96+

5price

36

◆, (4.5)

where �0 denotes a scaling factor which controls the overall portfolio average. Figure 4.10 shows the

model equation (4.5) for �0 = 0.1 and age 2 [18, 90] resp. price 2 [15, 85]. The modeled frequencies


lie in �(age, price) 2 [5.6%, 26.7%].

0.06

0.08

0.1

0.12

0.14

0.17

0.19

0.21

0.23

0.25

0.27

20 30 40 50 60 70 80 90

2030

4050

6070

8090

Age

Price

Frequency Model for Simulated Data

(a) Bivariate view

20 30 40 50 60 70 80 90

0.01

00.

016

0.02

2

Frequency Model (Univariate)

Age of Driver

λ(ag

e, 5

0)

20 40 60 800.

006

0.01

00.

014

Price of Car

λ(50

, pric

e)

(b) Univariate views

Figure 4.10 – Illustration of claims frequencies obtained by model equation (4.5) with �0 = 0.1.

4.3.1 Negative-Binomial Distribution

In order to simulate from model equation (4.5), we need to specify the claims count distribution.

The Poisson distribution has the special characteristic that its expected value equals the variance,

i.e. E(N) = V ar(N). Unfortunately, this is usually not supported by real data such as the Motor

Insurance Collision Data Set. Insurance claims data rather suggests that E(N) < V ar(N). In order

to account for this issue, we introduce the negative-binomial distribution.

Definition 4.7 (Negative-Binomial Distribution). N has a negative-binomial distribution, write

N ⇠ NegBin(�v, �), with volume v > 0, expected claims frequency � > 0 and dispersion parameter

� > 0, if

1. � ⇠ �(�, �),

2. conditionally, given �, N ⇠ Poi(��v).

The idea of Definition 4.7 is to attach a source of uncertainty to the claims frequency �. This

additional volatility leads to the desired property,

V ar(N) = E(V ar(N |�)) + V ar(E(N |�)) = v�+ (v�)2/� > �v = E(N).


For the simulation studies in the following chapters, we will use equation (4.5) to model negative-

binomially distributed claims data, where we choose the dispersion parameter as well as the overall

portfolio average �0 from the Motor Insurance Collision Data Set. The choice of parameters is further

discussed in Section 4.3.2.

Remark 4.8. Note there is an important di↵erence between the Poisson-Gamma Model (Definition

3.11) and the Negative-Binomial Model. Assume that N1, . . . , Nm

are independent negative-binomial

with Ni

⇠ NegBin(�(xi

)vi

, �), which means that each Ni

belongs to another independent latent

factor � ⇠ �(�, �). Whereas for the Poisson-Gamma Model, the � is identical for each Ni

and the

Ni

are only conditionally independent. J

In order to simulate claims counts from model equation (4.5), we have implemented a data generator,

which proceeds in the following steps:

Algorithm 4.1 Generating Data for Simulation

Inputs: Sample size n, dispersion parameter �, average overall portfolio frequency �0.

1. Generate independent covariate matrix (age, price) 2 R2⇥n.

2. Calculate expected claims frequency �(age, price) according to model equation (4.5).

for i = 1, . . . , n do

Generate � ⇠ �(�, �).

Generate Ni

⇠ Poi(�((age, price)i

)�), where (age, price)i

is the i-th row of (age, price).

end for

For simplicity, the durations vi

are set to one for all i. Note that the covariates are also simulated.

For the simulations studies we will use the following setting:

Assumption 4.9 (Simulation Study Data Set).

Number of observations n = 2500000;

Overall claims frequency �0 = 0.1;

Covariate matrix (age, price) simulated with (age, price)i

⇠ P , where P denotes the empirical

distribution of the covariates age and price in the Motor Insurance Collision Data Set;

Claims count distribution Ni

⇠ NegBin(�((age, price)i

), �), where �(·) is given by model equa-

tion (4.5) and � = 1.5, see Section 4.3.2.

4.3.2 Choice of Parameter

The simulated data set contains the two parameters �0 and �. Therefore we will quickly discuss how

we choose them for the simulated data according to Assumption 4.9.


Dispersion Parameter for Negative-Binomial Model:

In order to simulated negative-binomially distributed claims counts we need to estimate the dispersion

parameter �. Assume that N1, . . . , Nn

are independent with Ni

⇠ NegBin(�(xi

)vi

, �). We describe

a maximum likelihood based approach to determine �, see Section 2.4. The derivative of the log-

likelihood function for the claims counts N1, . . . , Nn

is given by

@

@�logL(�) = @

@�

nX

i=1

log

✓N

i

+ � � 1

Ni

◆+ � log(1� p

i

) +Ni

log pi

= 0, (4.6)

where pi

= �(xi

)/(�+�(xi

)vi

). The proof of equation (4.6) is based on the Polya parametrization of

the negative-binomial distribution and can be found in Chapter 2 of Wuthrich (2015). Unfortunately,

to solve this equation, we need to know the true �(xi

) for i = 1, . . . , n. However in practice, we

do not know the true underlying model equation. In fact, this is rather our ultimate goal, which

we want to predict. In order to derive an estimate for � from the Motor Insurance Collision Data

Set, we may choose �(xi

) ⌘P

n

i=1Ni

/P

n

i=1 vi as the overall portfolio average. Then we can solve

equation (4.6) numerically, resulting in an estimate of � ⇡ 1.08. However this estimate is likely to

be too small, since it does not reflect the structural di↵erences. Therefore, we estimate � again on

the subset of policies with age 2 [40, 80] and price 2 [400000, 600000]. This is the largest risk class

chosen by the bivariate regression tree on the Motor Insurance Collision Data Set, see Section 5.5.

This procedure takes the underlying structure into account such that the additional variability in the

data only comes from the overdispersion parameter. This results in an estimate of � ⇡ 1.5. Using

that

V co

✓N

i

vi

◆=

s1

vi

�(xi

)+

1

�,

as well as �(xi

) 2 [5.6%, 26.7%], we get a coe�cient of variation of between 2.1 and 4.3 for policies

with a full duration.

Overall Portfolio Average �0:

The scaling of model equation (4.5) is chosen such that when simulating from Assumption 4.9 , the

overall portfolio average1Pn

i=1 vi

nX

i=1

Ni

is roughly equal to �0. For the Motor Insurance Collision Data Set we observe an overall portfolio

frequency close to 10%, hence we choose �0 = 0.1.

Chapter 5

Data Analysis: Tree Model

In the present chapter, we fit the tree model introduced in Chapter 3 to insurance claims data. We

focus on the issue of predicting the expected claims frequency of a policyholder. First we use a

simulated data set and in a second step the Motor Insurance Collision Data Set. The data analysis

was conducted using the programming language R.

5.1 Fitting a Tree Model in R

The package rpart provides an e�cient implementation of the recursive binary partitioning introduced

in Section 3.1.2. For growing a tree, we use the function rpart. For illustration, we first present a

detailed example of the functionality of the rpart package:

Step 1: Growing a tree

Given a response y = N/v and covariates x1, . . . , xp, we can fit a tree using the rpart function as

follows:

tree_anova <- rpart(y~x1+...+xp, data = data, method = "anova",

weights, control=rpart.control(minbucket, usesurrogate, cp))

tree_poisson <- rpart(cbind(v, N)~x1+...+xp, data = data, method = "poisson",

parms = list(shrink = k),

control = rpart.control(minbucket, usesurrogate, cp))

Comments:

The input method specifies the type of tree. Options are: anova for classical sum of squares

regression (see Section 3.1.2), poisson for Poisson regression trees (see Section 3.2.1) and class

for Gini-index based classification trees (see Section 3.2.3).

60

5.1. FITTING A TREE MODEL IN R 61

The optional input weights can be used in combination with the method anova to obtain a

weighted anova tree (see Section 3.2.2).

minbucket sets the minimum number of observation to be contained in a leaf.

cp denotes a factor, which specifies the minimum improvement of the overall fit such that a split

will be attempted. See Remark 3.9 and Step 2 below for more details. Choosing cp = 0 grows

a tree as large as possible, until no further split is possible. However choosing a reasonable

cp-value may save a large amount of computing time.

The input usesurrogate controls the use of surrogate variables (see Section 3.3.2).

For the method poisson, we can hand over a shrinkage parameter k, which denotes the coe�cient

of variation of the prior distribution in the Poisson-Gamma model, see Section 3.2.1 as well as

Section 5.3.1.

For more details we refer to Therneau et al. (2015).

Step 2: Extracting the optimal pruning parameter

Before we can proceed with the pruning process, we need to choose an optimal tuning parameter

↵ (see Section 3.1.3). This is usually done by K-fold CV (see Section 2.3.1). Recall from Section

3.1.3, there exists an increasing sequence ↵k

of optimal pruning parameters each corresponding to an

optimally pruned subtree T (↵k

). Figure 5.1 shows the cross-validation errors of the optimally pruned

subtrees T (↵) as a function of the cost-complexity parameter cp, see Remark 3.8. Figure 5.1a shows

the cross-validation error (2.7), whereas Figure 5.1b shows the cross-validation error normalized by

R(T (1)), i.e. the error of the root tree, denoted by X-val Relative Error. This corresponds to the

output of the build-in function plotcp. As we show in the next sections, choosing the optimal ↵ resp.

cp is a delicate issue. The most common selection criteria are:

Minimum Cross-Validation Error: Selecting the subtree with minimal cross-validation error,

see black vertical line in Figure 5.1a.

1-Standard Error (1se) Rule: In practice, one often encounters a U-shaped cross-validation

error plot with a flat valley around the minimum. Breiman et al. (1984) suggested to choose

the smallest subtree within one standard error of the subtree with minimal cross-validation

error. This results in a tree with only slightly larger error, but possibly significant reduction

in complexity of the model, see red vertical line in Figure 5.1a.

Elbow Criterion: Negassa et al. (2000) suggested to choose the subtree corresponding to the

point beyond which the cross-validation error stops to decrease rapidly. In Figure 5.1a the

elbow criterion coincides with the 1se rule.

5.2. REGRESSION TREE FOR CLAIMS FREQUENCY 62

●

●

●

●● ●

● ●●

● ●

3500

4000

4500

5000

5500

6000

cp

Cro

ss−V

alid

atio

n Er

ror

0.29154 0.0566 0.02031 0.01282 0.01

1 2 3 4 5 6 7 9 10 11 12

Number of Leaves

1seMin CV

(a) Output: Customized CV

●

●

●

● ● ●

● ●●

● ●

cp

X−va

l Rel

ative

Erro

r

0.5

0.6

0.7

0.8

0.9

1.0

1.1

Inf 0.069 0.025 0.015 0.012 0.01

1 2 3 4 5 6 7 9 10 11 12

size of tree

(b) Output: plotcp()

Figure 5.1 – The figure shows the 10-fold cross-validation error (2.7) as a function of the cost-complexityparameter cp. The vertical lines denote possible choices of the cp parameter according to di↵erent selectioncriteria. The vertical error bars show the standard errors (2.8). On the left, we see the actual cross-validation error, whereas on the right we see the error normalized by the error of the root tree. The ploton the right-hand side can be obtained by the function plotcp provided within the rpart package.

Step 3: Pruning

Having chosen the optimal cost-complexity parameter cp resp. ↵, we need to prune to the optimal

subtree T (↵), as described in Section 3.1.3. The rpart package provides the function prune for this

task. The following command returns the desired optimally pruned subtree.

tree_pruned <- prune(tree, cp = cp_opt)

5.2 Regression Tree for Claims Frequency

In order to analyze the performance of the tree method within the context of insurance claims data,

we proceed in the following steps:

1. Simulated Data Set: We start our analysis on a generated data set in Section 5.3. This

has the advantage, that we know the true underlying model, i.e. when manipulating the

parameters, we can compare the fitted to the true values.

2. Univariate Trees: When applying the method to the Motor Insurance Collision Data Set,

we start with univariate regression trees in Section 5.4. This allows us to understand the splits

5.3. REGRESSION TREE FOR SIMULATED CLAIMS FREQUENCY 63

in more details and we can use the results to back-test the risk classes introduced in Section

4.2.2.

3. Bivariate Trees: In Section 5.5, we slightly increase the complexity by using two covariates.

For two covariates, we can still understand the interactions and are able to visualize the results.

4. Multivariate Trees: Finally, in Section 5.6 we consider a fully multivariate model, taking

into account all covariates recorded.

Given the volume v > 0, we model the response (2.1) by

Y =N

v= f(X) + ",

where N denotes the claims count. Given the volume vi

> 0, we define the frequency of the i-th

policy byN

i

vi

=N

vi

��X=x

i

= f(xi

) + "i

.

Let N1, . . . , Nn

denote the claims counts for each policy, v1, . . . , vn their respective durations and

expected claims frequencies for i = 1, . . . , n given by

�(xi

) = E✓N

i

vi

◆= E

✓N

vi

��X = x

i

◆= f(x

i

).

In the context of regression trees, our predictor is of the form

�(xi

) = fn

(xi

) =MX

m=1

�m

1{xi

2Rm

},

where Rm

, m � 1 denote the leaves of the resulting tree and �m

the constant in each leaf. In order

to benchmark the results, we compare them to a standard GLM model with a Poisson assumption

(see Ohlsson and Johansson (2010) and Chapter 7, Wuthrich (2015)).

5.3 Regression Tree for Simulated Claims Frequency

In a first step, we analyze the tree method on a simulated data set. This has the advantage that we

can artificially manipulate certain characteristics in order to identify the key issue that are important

when fitting a tree predictor to an insurance data set. The simulated data introduced in Section

4.3 tries to recover the characteristics of the Motor Insurance Collision Data Set and involves two

covariates age of driver and price of car. When fitting a regression tree based on the weighted sum

of squares option, we use the weights wi

= vi

. This results in the model introduced in Example 3.17.

When simulating from Assumption 4.9, all observations have a volume of one year at risk. Hence

the MLE estimator of the methods Anova, Weighted Anova and Poisson coincide. For the Weighted


Anova and Anova also the minimization criteria coincide, therefore they results in the same tree.

Simulating from Assumption 4.9, the claims counts are conditionally Poisson-distributed, therefore

we choose the Poisson method to demonstrate the results.

5.3.1 Choice of Parameter for Poisson Tree Predictor

For the Poisson tree predictor we have two additional parameters, which need to be estimated from

the Motor Insurance Collision Data Set.

Minimum Leaf Size:

In Section 4.2.2, we have seen that for Poisson-distributed claims counts, we need a total of 4000

years at risk within a risk class to obtain su�ciently small confidence bounds, i.e. standard errors

smaller than ±0.5%. Choosing the minimum leaf size su�ciently large may also reduce the variability

of the predictor, see Section 3.4.2. Since each leaf of a tree predictor denotes a risk class, we set the

minimum leaf size parameter minbucket = 4000. Therefore the minimum volume in a leaf is equal

or slightly smaller than 4000 years at risk, if there are observations without a full duration.

Shrinkage Parameter for Poisson-Gamma Model:

The Poisson method contains the shrinkage parameter k = V co(⇤), which denotes the coe�cient of

variation in the Poisson-Gamma Model assumed in each leaf, see Section 3.2.1. Again choosing k

from a real world data set is a delicate issue. If we assume that in each leaf l, the claims counts

N1, . . . , Nm

l

are described by a Poisson-Gamma model, then we would choose a di↵erent k for each

leaf. However, we do not know the leaves in advanced, instead we use the tree to determine these.

Therefore the algorithm is designed to choose the same k for all leaves. In order to obtain an estimate

for k, we consider the largest risk group chosen by the bivariate regression tree (see Section 5.5) with

age 2 [40, 80] and price 2 [400000, 600000]. Let N1, . . . , Nm

be the claims counts of policies in this risk

class. Only considering one leaf removes the structural di↵erences between the leaves. Therefore it is

reasonable to assume that all observations have the same expected frequency. Recall from Definition

3.11, conditionally given ⇤, N1, . . . , Nm

are independent with Ni

⇠ Poi(⇤vi

) with prior distribution

⇤ ⇠ �(�, c). The parameter c is calibrated using �/c = E(⇤) = �0, where �0 is estimated by the

average frequency within the risk class, i.e.

�0 =1Pm

i=1 vi

mX

i=1

Ni

.

To estimate the parameter �, note that

E(Ni

) = E(E(Ni

|⇤)) = E(⇤vi

) = �0vi,


and since V ar(⇤) = �/c2 = �20/� we have that,

V ar(Ni

) = V ar(E(Ni

|⇤)) + E(V ar(Ni

|⇤)) = v2i

�20

�+ v

i

�0.

Therefore the coe�cient of variation for Ni

is given by

V co(Ni

) =

r1

vi

�0+

1

�.

In order to estimate �, we restrict the data to observations with a full duration, i.e. vi

= 1 and use

the method of moments estimator for V co(Ni

) given by

dV co(Ni

) =bsd(N1, . . . , Nm

)

µ(N1, . . . , Nm

)=

r1

�0+

1

�,

where µ denote the sample mean, bsd the sample standard deviation. Solving for � results in an

estimate � ⇡ 1.37 and therefore a shrinkage parameter of k = 1/p� ⇡ 0.87.

Note that we have set the minimum leaf size to 4000 observations. Assuming for simplicity that

vi

= 1 for all observation in a leaf with average frequency �0 = 10%. By equation (3.10), the

credibility weight ↵ is given by

↵(k) =4000

10.1k2 + 4000

.

Figure 5.2 shows the credibility weights as a function of the coe�cient of variation. We observe that

↵(k) converges to 1 very quickly, meaning that we mainly believe in the observation based estimator.

We conclude that the parameter is only of minor importance regarding the outcome of the predictor.

It also justifies the use of the same k for every leaf.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

k

α(k)

Figure 5.2 – Credibility weights ↵(k) as a function of the coe�cient of variation.


5.3.2 Stratified Cross-Validation

Classically the optimal tree size is obtained by cross-validation, see Step 2 in Section 5.1. In this

section, we illustrate several issue associated to the use of cross-validation for insurance claims data

and propose stratified cross-validation as a possible solution.

Breiman et al. (1984) suggest to use the one standard error (1se) rule instead of the minimal

cross-validation error subtree for the choice of tree size. However Negassa et al. (2000) have explicitly

studied tree size selection criteria for Poisson regression trees and observed that the standard errors

of the cross-validation results are often very large, leading to a pessimistic interpretation of the one

standard error rule. They proposed the elbow criterion as a possible solution. In order to analyze the

situation for insurance claims data, we have generated a data set using Assumption 4.9. In Figure 5.3

the output of the standard plotcp function is shown. Note that we observe only a small improvement

of slightly more than 1% from the root tree to the minimum cross-validation error tree in terms of

cross-validation error. Additionally, we observe that the standard errors for the cross-validation folds

are fairly large. Similar to the findings of Negassa et al. (2000), this causes the 1se rule to choose a

small tree. In our case the 1se rule results in a tree with only 4 leaves in contrast to 56 leaves for

the minimum cross-validation error (Min CV) tree, indicated by the red resp. black vertical lines in

Figure 5.3. The elbow criterion on the other hand chooses a tree with 9 leaves. Summing up, there

●

●

●

●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●

cp

Rel

ative

Cro

ss−V

alid

atio

n Er

ror

0.98

00.

990

1.00

01.

010

1 8 17 27 41 54 65 79 89 102 116Number of Leaves

Inf 0.00011 4e−05 2e−05 1.2e−05 5.6e−06

1seElbowMin CV

Figure 5.3 – Cross-validation results obtained by the plotcp function.

are two main issues associated to the cross-validation results in Figure 5.3:

1. Why are the relative improvements to the mean model (i.e. root tree) only of the magnitude

of roughly 1%?

2. Why are the standard errors of the cross-validation errors so large?


The answer to both questions is related to the unbalancedness of the data. Figure 5.4 shows a

histogram of the observed frequencies of the generated data set. We see that the data is unbalanced

in the sense that the vast majority of observations are zero due to the low expected frequency.

Considering the mean squared error as a measure of goodness, then by Proposition 2.2, we have the

following decomposition

E⇣(Y � Y )2

⌘= E

✓Bias

⇣f(X,L)

⌘2◆+ E

⇣Var

⇣f(X,L)

⌘⌘+Var("),

which is estimated by the cross-validation error (see Section 2.3.1). Choosing the optimal model, we

can only improve the first two terms. However for unbalanced data, the variability of the residuals

" = Y � f(X) is dominant. For the data shown in Figure 5.4, we have calculated the residuals and

used the sample variance to estimate the third term. We observe that the sample variance of the

residuals is roughly 99% of the cross-validation error. Therefore it dominates the other two terms,

which implies that we can only expect to see small di↵erences relative to the root tree.

Observed Claims Frequency

Num

ber o

f Obs

erva

tions

0 1 2 3 4 5 6 7

050

000

1500

00

Figure 5.4 – Histogram of unbalanced data.

The second question could be ignored when using the elbow criterion. However the subjectivity

associated to the elbow approach is not very satisfactory. For unbalanced data, Breiman et al.

(1984) suggested the use of stratified cross-validation for estimating the generalization error. Since

due to a small �0, the majority of policies have zero claims, when dividing the data randomly into K

equally sized subsets, it may occur that one of the folds contains a disproportional amount of large

frequencies. This sample will inflate the sum of squares resp. the deviance statistics, which eventually

is shown in a large standard deviation. The idea of stratified sampling for the cross-validation folds

is to first order the data according to the observed frequencies. Next form bins containing the K

smallest frequencies, then the K next smallest and so on. In the end, we draw from these bins

without replacement resulting in K partitions of the data which all have a similar distribution of the

observed frequencies, see Algorithm 5.1.


Algorithm 5.1 Stratified sampling

Input: Learning data set L = {(x1, N1, v1), . . . , (xn

, Nn

, vn

)}.Output: Stratified cross-validation folds (B

k

)k=1,...,K .

1. Order data according to a stratification variable, for us the claims frequency yi

= Ni

/vi

,resulting in a permutation ⇡(i) for i = 1, . . . , n.

2. Define the index sets

Bk

:= {i 2 {1, . . . , n} : ⇡(i)� k mod K = 0},

where mod denotes division with remainder.

3. SetBk

:= {(xi

, Ni

, vi

) 2 L : i 2 Bk

}.

Remark 5.1. A close inspection of the source code of the rpart-package reveals that stratified

sampling is actually already implemented. The code translated into R looks as follows:

data <- data[order(data$y),] #order data according to

#response y

xgroups <- rep(1:K, length = nrow(data)) #Calculate the K different folds

#for cross-validation

Unfortunately, when using the Poisson method, the response variable is a (n⇥ 2)-matrix, where the

first column contains the volumes vi

and the second the observed claims counts Ni

. The implement-

ation orders the data according to the first column, which does not lead to our understanding of

stratified sampling. J

Due to Remark 5.1, we have implemented the cross-validation procedure separately. The R code

can be found in the Appendix C.2. Figure 5.5 shows the results obtained for both classical and

stratified cross-validation. We see that stratification significantly reduces the variance, whereas the

cross-validation error itself remains fairly stable. The reduction in variance implies that the chosen

tree size becomes larger when using the 1se rule.

5.3.3 Optimal Tree Size

Since we use a simulated data set, we know the true underlying model equation (4.5) and can therefore

compute the exact estimation error of the obtained tree predictors. This enables us to compute the

optimal tree size and compare this to di↵erent selection criteria.


●

●

●

●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

1185

011

950

1205

012

150

cp

Cro

ss−V

alid

atio

n Er

ror

0.00772 0.00011 5e−05 3e−05 2e−05 1e−05 1e−05 1e−05 0

1 6 12 19 26 37 44 51 61 68 78 85 92 99 107 116

Number of Leaves

1seElbowMin CV

(a) Stratified 10-Fold Cross-Validation

●

●

●

●

●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●

1185

011

950

1205

012

150

cp

Cro

ss−V

alid

atio

n Er

ror

0.00772 0.00011 5e−05 3e−05 2e−05 1e−05 1e−05 1e−05 0

1 6 12 19 26 37 44 51 61 68 78 85 92 99 107 116

Number of Leaves

1seElbowMin CV

(b) Classical 10-Fold Cross-Validation

Figure 5.5 – Comparison of stratified and classical cross-validation.

Measure of Performance

A priori it is not clear which measures of performance resp. measures of prediction error should

be used. We will discuss di↵erent possibilities and highlight their advantages and drawbacks. As-

sume we are given a learning sample L := {(xl

1, Nl

1, vl

1), . . . , (xl

m

, N l

m

, vlm

}, where the claims counts

N l

i

are independent with expected claims frequency �(xl

i

) as well as an independent test data set

T := {(x1, N1, v1), . . . , (xn

, Nn

, vn

)}, also with independent claims counts Ni

and expected claims

frequency �(xi

). Based on the training set L, we fit the predictor �(xi

) and then evaluate the

performance on the test data set T , using the following criteria:

Absolute Frequency Deviation: The absolute frequency deviation measure calculates the

sum of absolute deviation of the predicted frequency to the true frequency,

nX

i=1

|�(xi

)� �(xi

)|. (5.1)

The absolute error penalizes already small prediction error. It is therefore the favorite choice when

aiming to find the true underlying structure. Note that typically |�(xi

) � �(xi

)| < 1, hence a

quadratic loss function would shrink the error towards zero, see Figure 5.6. However, due to the

large volume of our portfolio already a small deviation form the true � has a large impact. This

explains the choice of the absolute rather than the squared error. Note there is a hinge regarding

the absolute frequency deviation. In practice, even if we have an independent test data set, we do

not know the function �(x). Therefore its use is limited to simulated data.

Absolute Claims Deviation: The absolute claims deviation measure calculates the sum of


−1.0 −0.5 0.0 0.5 1.0

0.0

0.2

0.4

0.6

0.8

1.0

Loss Functions for |x| < 1

x

f(x)

f(x) = x2

f(x) = |x|

Figure 5.6 – Comparison of the absolute versus squared loss for small errors.

absolute deviation of the predicted claims count to the observed claims count,

nX

i=1

|Ni

�Ni

|, (5.2)

where Ni

= �(xi

)vi

. The absolute claims deviation can also be used when the underlying model

structure is unknown. Note that for the simulated data, we set vi

= 1 for all i, therefore the absolute

claims deviation is equal tonX

i=1

��(xi

)� Ni

vi

�� ,

i.e. the absolute observed frequency deviation.

Out of Sample Deviance: For the out of sample deviance, we evaluate the Poisson deviance

statistics on the test set T ,

nX

i=1

Ni

log

N

i

�(xi

)vi

!� (N

i

� �(xi

)vi

).

The idea of using the deviance statistics is based on the maximum likelihood approach discussed in

Section 2.4.

Out of Sample Prediction Error: The out of sample prediction error compares the total

number of predicted claims N =P

n

i=1 �(xi

)vi

versus the total number of observed claims in the test

data set N :=P

n

i=1Ni

,

|N �N | =

��

nX

i=1

(�(xi

)vi

�Ni

)

�� . (5.3)


A small error means, that we are able to predict the overall number of claims fairly well. Instead of

considering the whole portfolio, we may also consider each risk class separately and use a weighted

sum. Let R1, . . . , RM

denote the regions obtained by the tree algorithm, then consider the prediction

errorMX

m=1

wm

��

X

{i:xi

2Rm

}

(�(xi

)vi

�Ni

)

��,

where wm

, m � 1 denote a priori specified weights. However for the remainder, we only consider the

out of sample prediction error according to equation (5.3).

The choice of measure of performance may depend on the aim of the analysis. For our purpose of

reconstructing the true underlying model based on previously observed data, the absolute frequency

deviation would be the most natural choice. Since it is not available in practice, our goal is to see

whether any of the other options provides similar results. In order to analyze the optimal tree size,

we have generated a learning sample L based on Assumption 4.9. In addition, we have simulated an

independent test data set with n = 100000000 observations. Figure 5.7 shows the above mentioned

measures of performance for each subtree. We observe that the out of sample deviance as well as

the absolute frequency deviation have a similar behavior and attain their minimum at the same

tree size. This is of great importance to us, since on real data we cannot compute the absolute

frequency deviation but we can calculate the out of sample deviance. The absolute claims deviation

is decreasing in the tree size, hence does not prevent from overfitting. For the out of sample prediction

error, we do not observe a clear structure related to the tree size. We conclude that the optimal

tree size lies around 30 leaves. Comparing this result to Figure 5.5 reveals that even when using

stratified cross-validation the 1se rule seems to be too pessimistic. Figure 5.8 shows the predicted

CP

OoS DevianceAbsolute Frequency DeviationAbsolute Claims DeviationOoS Prediction Error

0.00772 0.00015 9e−05 5e−05 4e−05 3e−05 2e−05 2e−05 1e−05 1e−05 1e−05 1e−05 0

1 4 7 11 17 22 29 38 44 50 57 64 73 79 85 91 97 102 110 116

Figure 5.7 – Comparison of optimal tree size according to di↵erent measures of performance. Theabbreviation ’OoS’ stands for ’Out of Sample’.


values of the 1se tree versus the optimally sized tree (according to the minimum absolute frequency

deviation) as well as the true underlying structure corresponding to Figure 5.7. Figure 5.8a shows

the predictor when the price is fixed at 30’000 CHF, whereas for Figure 5.8b we fixed the age of

driver at 50 years. We see that both fits have their di�culties to capture the continuous shape of

the true underlying model. This is however not very surprising, due to the definition of the tree

predictor as a piecewise constant function. Apart from that, both predictors capture the underlying

structure very well. In order to further justify these findings, we performed a simulation study, where

20 30 40 50 60 70 80 90

0.10

0.15

0.20

Age of Driver

Cla

ims

Freq

uenc

y

Model1seOptimal

(a) Price fixed at 30’000 CHF

20 40 60 80

0.10

0.15

0.20

Price of Car

Cla

ims

Freq

uenc

y

Model1seOptimal

(b) Age fixed at 50 years

Figure 5.8 – Fitted values for the tree predictors according to 1se rule and optimal tree size comparedto the true underlying structure.

we simulated 100 independent training data sets Lk

for k = 1, . . . , 100 using Assumption 4.9. For

each learning sample, we fit a sequence of tree predictors. Using stratified cross-validation, we then

compute the tree size according to the the 1se rule, the minimum cross-validation error tree size as

well as a third criteria, called Mid-Point rule, where we choose the subtree which is the Mid-Point

between the one standard error rule and the minimal cross-validation error. The Mid-Point tree can

be obtained by:

ind_1se <- which(min(cv_err)+cv_se[which.min(cv_err)] > cv_err)[1]

ind_mp <- ceiling((ind_1se+which.min(cv_err))/2)

In addition to the learning samples, we have simulated a test sample T with n = 100000000 obser-

vations. Each subtree is validated on the test data set T by computing the out of sample deviance

statistics as well as the absolute frequency deviation. From that we calculate the tree size according

to the minimum out of sample deviance resp. minimum absolute frequency deviation. The results

of the simulation study are shown in Figure 5.9. We see that the chosen tree size for both the out


of sample deviance and the absolute frequency deviation lies around 30 leaves. We also see that the

one standard error rule is highly over-pessimistic. Figure 5.9a suggests the minimal cross-validation

error rule as an optimal selection criteria in order to obtain a tree size which is roughly equal to

tree with minimal absolute frequency deviation. However in the right panel we see that there is a

large variability associated to the minimum cross-validation error rule. The proposed Mid-Point rule

gives an alternative to the 1se rule, which provides less pessimistic tree sizes but is still conservative

enough to avoid overfitting. Alternatively to the cross-validation based criteria, the out of sample

deviance selects a tree size similar to the absolute frequency deviation.

(a) Chosen tree size for di↵erent measure (b) Di↵erence to optimal tree size

Figure 5.9 – Simulation study results for optimal tree size.

5.3.4 Tree Size Selection via Statistical Significance Pruning

In the last section, we have seen that cross-validation in combination with unbalanced data raises

some issues. In this section, we present an alternative approach to tree pruning based on statistical

significance tests. The idea is to fit a large tree first. Then we perform statistical tests on the leaves

to decide, whether or not the separation into two groups in significant. If not, we prune the leaf away.

We continue this process until no further leaves are pruned o↵, see Algorithm 5.2. This approach

circumvents the problems imposed by cross-validation. The R-Code for the pruning procedure can

be found in the Appendix C.3. Note that this approach does not guarantee the final tree to be an

optimally pruned subtree according to Definition 3.6.

In order to demonstrate the approach, we have simulated a learning data set L using Assumption

4.9 and grew a large tree until no further split was possible. For the statistical significance pruning,

we used the Wilcoxon test at a significance level of ↵ = 5%. We stopped the pruning process when

no further leaf was pruned o↵ for 10 successive bootstrap samples of L. Figure 5.10 shows the


Algorithm 5.2 Statistical Significance Pruning

Grow a large tree Tmax

based on a learning set LDefine N

prune

to be the number of leaves, which are pruned o↵ in each iteration.Set N

prune

= 1while N

prune

> 0 doDraw a bootstrap sample L⇤ form L.Use T

max

to predict the values of the test set L⇤.for all Nodes in T

max

with two terminal leaves, leafL

and leafR

doPerform a statistical test based on the Null hypothesis

H0 : leafL = leafR

HA

: leafL

6= leafR

If H0 is rejected at a significance level ↵ merge the two leaves to one leaf.end forUpdate N

prune

Set Tmax

to be the new tree obtained after the pruning.end while

predicted values for the subtrees obtained by statistical significance pruning (Stat), one standard

error rule (1se) as well as the unpruned tree (Full) for fixed price resp. fixed age. Overall we see

that Algorithm 5.2 removed all signs of overfitting. For the simulated learning sample L, the tree

obtained by significance pruning contains 18 leaves (similar to the Mid-Point rule, see Figure 5.9a).

On the other hand the 1se rule tree has 10 leaves.

20 30 40 50 60 70 80 90

0.10

0.15

0.20

Age of Driver

Cla

ims

Freq

uenc

y

ModelFull1seStat

(a) Price fixed at 30’000 CHF

20 40 60 80

0.10

0.15

0.20

Price of Car

Cla

ims

Freq

uenc

y

ModelFull1seStat

(b) Age fixed at 50 years

Figure 5.10 – Predicted values of the subtree obtained by statistical significance pruning (Stat) incomparison to other methods as well as the true underlying structure.


The proposed procedure provides an alternative to the cross-validation based cost-complexity prun-

ing. Note that the significance level ↵ was arbitrarily set to 5%. This should rather be seen as

a tuning parameter. Overall the method provided slightly larger trees compared to the 1se rule,

however further research on this method is needed.

5.3.5 Variability of Data

Due to the definition of the tree predictor as a piecewise constant function, we would like to under-

stand how close we can approximate the true underlying function which does not need to be a step

function itself. In this section, we argue that the coe�cient of variation of the data is an import-

ant factor when analyzing the scope to which the tree predictor is able to recover the underlying

structure. Therefore we discuss the variability of the data, measured by the coe�cient of variation,

and its relation to portfolio characteristics such as the claims frequency as well as the volume. In a

second step, we then analyze the e↵ects upon the tree predictor.

Poisson Claims Counts:

In a first step, we investigate the situation for Poisson-distributed claims counts. For each poli-

cyholder i = 1, . . . , n, denote by �(xi

) its expected claims frequency depending on the covariates

x

i

2 Rp. Assume that Ni

⇠ Poi(�(xi

)vi

) for volumes vi

> 0 and i = 1, . . . , n. Then the coe�cient

of variation of the observed frequencies is given by

V co

✓N

i

vi

◆=

1p�(x

i

)vi

. (5.4)

Therefore, if �(xi

)vi

! 0, the coe�cient of variation of the claims frequency becomes infinitely large.

This means that the one standard error confidence bounds, which are given by

CI =

✓E✓N

i

vi

◆� V co

✓N

i

vi

◆E✓N

i

vi

◆,E✓N

i

vi

◆+ V co

✓N

i

vi

◆E✓N

i

vi

◆◆,

become infinity large as well. As discussed in Section 4.2.2, actuaries solve this issue by introducing

risk classes of homogeneous subportfolios. Let N i

1, . . . , Ni

m

i

denote the claims counts of the policies

within risk class i. Assume that N i

j

⇠ Poi(�i

vj

) are independent for j = 1, . . . ,mi

and for all i.

Then the total number of claims of risk class i is given by

N i :=m

iX

j=1

N i

j

⇠ Poi(�i

m

iX

j=1

vj

), (5.5)


and denote by vi :=P

m

i

j=1 vj the volume of risk class i. The coe�cient of variation of the expected

frequency of risk class i is given by

V co

✓N i

vi

◆=

1q�i

Pm

i

j=1 vj.

In this light, the increased volume of the subportfolio results in a reduction of the coe�cient of

variation. We have the asymptotic relation,

m

iX

j=1

vj

! 1 ) V co

✓N i

vi

◆! 0, (5.6)

and hence also the one standard error confidence bounds tend to zero.

Remark 5.2. Note there exists a duality between increasing the portfolio size and the claims fre-

quency. For example, if we consider risks on a 10 years basis, then the portfolio size decreases by the

factor 1/10. Therefore the product �(xi

)vi

in equation (5.4) remains constant when varying the time

horizon on which claims frequencies are measured, i.e. increasing �(x) is equivalent to considering a

larger portfolio. J

Negative-Binomial Claims Counts:

As introduced in Section 4.3.1, we use negative-binomially distributed claims counts to account for

overdispersion in the data. Assume that Ni

⇠ NegBin(�(xi

)vi

, �) for i = 1, . . . , n. We can again

compute the coe�cient of variation, which leads to

V co

✓N

i

vi

◆=

s1

�(xi

)vi

+1

�>

1p�.

Hence we see that there is a lower bound for the coe�cient of variation of an individual policy

imposed by the overdispersion parameter. Similar to the Poisson case, we can form risk classes.

Let N i

1, . . . , Ni

m

i

denote the claims counts of the policies within risk class i. Assume that N i

j

⇠NegBin(�

i

vj

, �) are independent for j = 1, . . . ,mi

and for all i. For negative-binomial claims

counts, the analogue to relation (5.5) does not hold true, i.e.

N i :=m

iX

j=1

N i

j

⌧ NegBin(�i

vi, �),

where vi :=P

m

i

j=1 vj denotes the volume of risk class i. However Lemma 5.3 elaborates the coe�cient

of variation including an asymptotic relation similar to (5.6).

Lemma 5.3. Let N i

1, . . . , Ni

m

i

denote the claims counts of policies in risk class i, where N i

j

⇠


NegBin(�i

vj

, �) and independent for j = 1, . . . ,mi

. Assume that the volumes vj

are bounded from

below and above, i.e. there exits � > 0 and M < 1 such that 0 < � vj

M < 1. Denote by

N i =P

m

i

j=1Ni

j

the total claims count of risk class i with total volume vi =P

m

i

j=1 vj. Then we have

the relation,

V co

✓N i

vi

◆=

vuuut1

�i

Pm

i

j=1 vj+

1

�

Pm

i

j=1 v2j⇣P

m

i

j=1 vj

⌘2 ! 0 if mi

! 1.

The proof of Lemma 5.3 can be found in Appendix A.6. Although we showed that the variability

tends to zero, the speed of convergence might be significantly slower than for the Poisson case.

However, the next example shows that for our purpose, the di↵erences to the Poisson model are only

marginal.

Example 5.4. To illustrate the impact of the dispersion parameter �, consider the following example.

Assuming all policies in the portfolio have a full year at risk vj

= 1, the coe�cient of variation is

then given by

V co

✓N i

vi

◆=

r1

�i

mi

+1

�

1

mi

.

Let �1 = 1.5 as estimated for the Motor Insurance Collision Data Set and �2 = 0.01 representing a

data set with stronger overdispersion Assume that the claims frequency in risk class i is �i

= 0.1.

Figure 5.11 shows the standard errors

SENegBin = E(N i/vi)V co(N i/vi) =

s�i

mi

✓1 +

�i

�

◆,

SEPoi = E(N i/vi)V co(N i/vi) =

r�i

mi

,

for the Negative-binomial and the Poisson case as a function of mi

, the number of policies in risk

class i. This example illustrates that even tough overdispersion includes extra volatility in the

data, forming subportfolios we can significantly reduce the variability. Furthermore, the di↵erences

between the Poisson and the Negative-binomial model with a dispersion parameter similar to the

Motor Insurance Collision Data Set are only marginal, suggesting that we could also directly work

with Poisson-distributed claims counts.


200 400 600 800 1000

0.00

0.02

0.04

0.06

0.08

0.10

Number of Policies in Risk Class i

Stan

dard

Erro

r

Negative−Binomial (γ=1.5)Negative−Binomial (γ=0.01)Poisson

Figure 5.11 – Convergence of the standard errors for Poisson and Negative-binomial claims counts.

?

Implications of the Data Variability on Tree Predictors

We have seen that the variability is controlled by the volume of the portfolio. Increasing the volume

leads to a decrease in variability. Next we are going to illustrate the e↵ect of the data variability

upon the tree algorithm. To understand, why the variability of the data is important to the tree

algorithm, consider the situation with only one covariate (p = 1), e.g. price of car. Assume that we

have recorded data for 4 di↵erent covariate values x1, x2, x3 and x4. For each value, we have 100

observations. Due to the variability within the data, the measurement will scatter around the true

values �(xi

). An illustrative example is shown in Figure 5.12a. The colored points on the y-axis in

Figure 5.12b denote the observed values for the di↵erent covariate values. The orange lines show the

chosen split-points, the red step function denotes an example of a tree predictor and the black line

the true underlying model. The aim of the tree algorithm is to group the observations in order to

recover the underlying structure. We see that, depending on the variability of the observations, the

tree algorithm may be able to distinguish two groups (e.g. blue and red). On the other hand, the

larger the variability, the less the tree is able to recover also small di↵erences (e.g. red and green).


Frequency

Num

ber o

f Obs

erva

tions

05

1015

2025

3035

λ(x1) λ(x2) λ(x3) λ(x4)

(a) Variability of observations

Covariate

Frequency

x1 x2 x3 x4

λ(x 1)

λ(x 2)λ(x 3)

λ(x 4)

−−

−

−

−

−

−−−−

−

−

−

−

−

−−−−−−−−

−

−−−

−−−−

−−−

−−−−

−−−−−−

−−

−−−−−

−−

−

−−

−−

−−

−

−−−−−

−

−

−

−

−

−

−

−−

−−−−−−−

−

−

−−−

−−−−

−−−−

−

−−−−

−−−

−−

−

−−−

−

−−

−

−−−−−−−−

−

−−−−−−−−−−

−

−

−

−

−−−−

−

−

−

−−−

−

−

−−

−−−−−−

−

−−

−

−−

−−

−

−

−

−

−−

−

−−−−−−

−

−−

−

−−

−

−−

−

−−−−−

−

−

−−

−

−

−−

−

−−

−

−

−

−−−−−−−−

−−

−−

−−

−

−−−−

−

−

−

−−

−

−

−

−−−−

−

−

−−−−

−

−

−

−−−

−−−−−−−

−

−−−

−

−−

−−

−−−−−−−

−

−

−−

−

−

−

−−−

−

−−−−

−−−

−−−

−

−

−−−

−−

−

−

−

−

−−

−−

−

−−−−−−

−

−−

−

−−−−

−

−−−−

−

−

−

−

−

−

−

−−−−−

−

−

−−

−−

−

−−

−−−−−

−−

−

−

−

−

−

−

−−−−−−

−−−−

−−

−

−−−−

−

−−

−

−−

−

−

−

−

−

−−−−−−

−

−−

−

split1 split2

(b) Recovered structures

Figure 5.12 – Illustration of the impact of the data variability on the tree predictor. The black lineindicates the true structure, the brown line shows the tree predictor. The colored lines on the y-axisdenote the observed frequencies.

5.3.6 Variable Selection

In this section we demonstrate the ability of the tree algorithm to perform variable selection whilst

growing the tree. For that purpose, we have simulated a learning data set using Assumption 4.9.

Recall that the covariates are sampled from the Motor Insurance Collision Data Set. In addition to

age and price, we have also sampled the policynumber. We then modified the policynumber such

that we only keep the last 2 digits. Otherwise the policynumber would be correlated with the age

covariate because they are assigned in ascending order. Since the covariate policynumber is not

included in the model equation (4.5), it should not be used as a splitting variables. In order to verify

this, we have calculated the variable importance for both the unpruned as well as the minimum CV

tree. Figure 5.13 shows that the minimum CV tree does not use the covariate policynumber in any

split neither primary nor surrogate. The finding supports the fact that the tree algorithm is capable

to perform variable selection. However note that this must also be seen in light of Figure 5.12. It

may be that a covariate indeed has a systematic impact on the response, however the variability of

the data is too large such that the tree algorithm is able to recover this structure.

5.3.7 Stability

A major issue when using tree predictors is their variability when changing the training data slightly.

In order to analyze this problem, we have simulated 10 learning data sets Lk

, k = 1, . . . , 10 from

the frequency model (4.5) based on Assumption 4.9. For each data set, we have calculated the tree


age price pol

Varia

ble

Impo

rtanc

e0

2040

6080

100

Tree Size: 21 Leaves

(a) Minimum CV Tree

age price pol

Varia

ble

Impo

rtanc

e0

2040

6080

100

Tree Size: 104 Leaves

(b) Unpruned Tree

Figure 5.13 – Variable importance plots for the tree predictors including the non-informative covariatepolicynumber (pol). Subfigure (a) shows the minimum cross-validation error tree. Figure (b) shows theunpruned tree.

predictor using the Mid-Point rule denoted by �MP (x1, x2,Lk

) as well as a GLM model with Poisson

assumption denoted by �GLM (x1, x2,Lk

). For each method, we estimate the coe�cient of variation

at (x1, x2) by

dV co(�1se(x1, x2)) =bsd(�1se(x1, x2,L1), . . . ,�1se(x1, x2,L10))

µ(�1se(x1, x2,L1), . . . ,�1se(x1, x2,L10)),

where bsd denotes the sample standard deviation and µ the sample mean. Analogously we proceed

for �GLM . In Figure 5.14 the results are visualized using a heat map. We observe that the GLM is

0

0.02

0.05

0.07

0.09

0.12

0.14

0.16

0.19

0.21

0.23

20 30 40 50 60 70 80 90

2030

4050

6070

80

Age

Price

(a) Mid-Point Rule Tree

0

0.02

0.05

0.07

0.09

0.12

0.14

0.16

0.19

0.21

0.23

20 30 40 50 60 70 80 90

2030

4050

6070

80

Age

Price

(b) GLM

Figure 5.14 – Stability comparison in terms of coe�cient of variation of the tree predictor and a GLM.

more stable, i.e. less e↵ected by changes in the training data and provides therefore a much more

stable model. These findings are in line with the consideration made in Section 3.4.2.

5.4. UNIVARIATE REGRESSION TREE FOR REAL DATA CLAIMS FREQUENCY 81

5.4 Univariate Regression Tree for Real Data Claims Frequency

In a first step, we fit univariate regression trees to the Motor Insurance Collision Data Set. There

are two main reasons for starting with only one covariate. First of all, the restriction to one covariate

enables us to understand the splits in detail. Secondly, the results can be used to verify the risk

groups shown in Section 4.2.2. The data used for the analysis was preprocessed according to Data

Set A, see Section 4.2.1. We present the detailed analysis for the covariate age of driver. For the

other covariates, we use these findings and present the results directly. For the Poisson method we

use a shrinkage parameter k = 0.87 as derived in Section 5.3.1. Since we can plot the predicted

values, we base the decision of optimal tree size on visual inspection by understanding the relevance

of the chosen splits as well as the underlying volumes. The analysis shows that there is not one

general rule that selects the optimal tree size. In practice, several choices should be tested.

Tari↵ Criterion 1: Age of Driver

We start by fitting a separate tree for each method: Anova, Weighted Anova and Poisson. Figure

5.15 shows the cross-validation results obtained from the plotcp function. The weights for weighted

anova were chosen to be wi

= vi

resulting in the model introduced in Example 3.17.

● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

cp

Rel

ative

Cro

ss−V

alid

atio

n Er

ror

0.8

0.9

1.0

1.1

1.2

Inf 9.1e−06 2.8e−06 1.8e−06 1.3e−06

1 3 4 8 9 15 18 21 23 28Number of Leaves

(a) Method: Anova

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

cp

Rel

ative

Cro

ss−V

alid

atio

n Er

ror

01

2

Inf 3.3e−05 5.1e−06 2.3e−06 1.3e−06

1 3 5 7 9 11 13 15 18 21 23Number of Leaves

(b) Method: Weighted Anova

●

●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

cp

Rel

ative

Cro

ss−V

alid

atio

n Er

ror

0.99

00.

995

1.00

01.

005

1.01

0

Inf 0.00016 3.4e−05 1.4e−05 7.9e−06

1 3 5 7 9 11 13 15 18 21 23Number of Leaves

(c) Method: Poisson

Figure 5.15 – Relative cross-validation error results including one standard error whiskers for the methodsAnova, Weighted Anova and Poisson.

There are two important observations from Figure 5.15:

1. The results for the methods anova and weighted anova are heavily distorted by the large

standard errors. Figure 5.16 shows the same relative cross-validation plot with the standard

error whiskers being removed. The absolute improvements for the methods weighted anova

and Poisson are rather small relative to the root tree. This is similar to the observations in

Section 5.3.2 and can be explained by the unbalanced nature of the data. Overall the weighted


anova and the Poisson method share a similar shape, compare Figures 5.15c and 5.16b. For

the anova method, we observe an increasing cross-validation error curve, meaning that the tree

is not able to recover any structure from the data. This can be explained by the fact that the

sum of squares measure does not take the underlying volume into account. Based on these

findings, we will not further elaborate on the anova method.

●

● ●● ● ●

● ● ● ● ● ● ● ● ●

cp

Rel

ative

Cro

ss−V

alid

atio

n Er

ror

0.99

900.

9995

1.00

001.

0005

1.00

10

1 3 4 8 9 15 18 21 23 28Number of Leaves

Inf 9.1e−06 2.8e−06 1.8e−06 1.3e−06

(a) Method: Anova

●

●

●

●● ●

● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

cp

Rel

ative

Cro

ss−V

alid

atio

n Er

ror

0.99

900.

9995

1.00

001.

0005

1.00

10

1 3 5 7 9 11 13 15 18 21 23Number of Leaves

Inf 3.3e−05 5.1e−06 2.3e−06 1.3e−06


Figure 5.16 – Relative cross-validation error results without one standard error whiskers for the methodsAnova and Weighted Anova.

2. We have seen in Section 5.3.3, that for unbalanced data the variance of the cross-validation

results can be reduced using stratified cross-validation. Due to Remark 5.1, we know that

the cross-validation results for the methods anova and weighted anova are already obtained

using stratification. However for the Poisson method, the rpart package does not apply the

stratification scheme correctly. In Figure 5.17 the stratified cross-validation errors for the

Poisson method including the one standard error whiskers are shown. The colored vertical

lines represent di↵erent choices of tree size.

●

●

●

●●

●● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●

2122

021

240

2126

021

280

cp

Cro

ss−V

alid

atio

n Er

ror

0.00131 0.00022 5e−05 3e−05 2e−05 1e−05 1e−05 0

1 3 5 7 9 11 13 15 18 21 23Number of Leaves

1seElbowMid−PointMin CV

Figure 5.17 – Stratified cross-validation error results for the univariate Poisson regression tree predictingthe expected claims frequency using the covariate age of driver.


We conclude that the cross-validation results for the methods weighted anova and Poisson have

a similar structure. However, using the weighted squared error as a loss function results in large

standard errors (even when using stratified sampling), making the use of the one standard error rule

questionable. Nevertheless, we could use the elbow or minimum cross-validation error rule for the

weighted anova method. Figure 5.18 shows the fitted values for di↵erent subtrees according to the

tree size selection criteria discussed in Section 5.3.3. We see that the trees are able to capture the

characteristics observed in the univariate analysis in Section 4.2.2. Also note that the fitted trees

for the two methods are very similar. In Figure 5.19 the elbow criterion trees for both methods are

shown. We see that the only di↵erence is the split at age < 78, for which the weighted anova method

chooses to split at age < 80 instead. Based on these observations, we conclude that the methods

20 30 40 50 60 70 80 90

0.10

0.12

0.14

0.16

0.18

0.20

0.22

Age of Driver

Cla

ims

Freq

uenc

y


(a) Method: Poisson

20 30 40 50 60 70 80 90

0.10

0.12

0.14

0.16

0.18

0.20

0.22

Age of Driver

Cla

ims

Freq

uenc

y

ElbowMin CV


Figure 5.18 – Fitted values for di↵erent choices of tree size.

weighted anova and Poisson result in very similar predictions. Since the deviance measure provides

better cross-validation results in terms of variability, we will focus on the Poisson method for the

remainder.

Remark 5.5. Using the Poisson deviance statistics during cross-validation for the weighted anova

method does not provide an alternative. In general it can happen that � = 0 within some leaves. In

that case, the deviance statistics is not well-defined. J

The predictions obtained by the minimum cross-validation error tree show clear signs of overfitting

the data around ages between 40 to 55. The one standard error tree does not seem to capture the

whole underlying structure, which is in line with the results obtained on simulated data in Section

5.3.3. Visual inspection suggests that the trees derived from the the Mid-Point rule or the elbow

criterion provide the best fits. Finally Figure 5.20 compares the fitted values of the Poisson trees

based on the elbow criterion and the Mid-Point rule with respect to the underlying volumes.


age >= 34

age < 78

age >= 50

age < 72

age >= 24

age >= 22

0.09714e+3 / 186e+3

0.111474 / 16e+3

0.1116e+3 / 202e+3

0.131510 / 14e+3

0.125812 / 66e+3

0.15696 / 6656

0.19516 / 4286

yes no

(a) Method: Poisson

age >= 34

age < 80

age >= 50

age < 72

age >= 24

age >= 22

0.097n=186e+3

0.11n=20e+3

0.11n=202e+3

0.14n=9244

0.12n=66e+3

0.15n=6656

0.19n=4286

yes no


Figure 5.19 – Trees based on the elbow criterion for the methods weighted anova and Poisson.

Age Group

Freq

uenc

y

18−2

223−2

425−3

435−5

051−7

172−7

778−9

90.00

0.05

0.10

0.15

0.20

0.00

0.05

0.10

0.15

0.20

020

40

Volu

me

(in %

)

(a) Tree Size: Elbow criterion

Age Group

Freq

uenc

y

18−2

223−2

425−2

627−3

435−5

051−7

172−7

778−8

1>8

10.00

0.05

0.10

0.15

0.20

0.00

0.05

0.10

0.15

0.20

020

40

Volu

me

(in %

)

(b) Tree Size: Mid-Point rule

Figure 5.20 – Risk groups with underlying volumes chosen by the tree algorithm.

Tari↵ Criterion 2: Nationality

The second tari↵ criterion we analyze using univariate regression trees is the covariate nationality.

Due to the reasons explained in the analysis of the covariate age of driver, we always use the Poisson


method for the sequel. In total there are 160 di↵erent nationalities within the Motor Insurance

Collision Data Set. However, there are also many exotic nationalities that contain only very few

observations. Therefore instead of one accounting year, we will use the accumulated data from 2010

up to 2014 for the analysis. Nevertheless, there still remain nations such as Macao, where we have a

total volume of less than one year at risk. We merge these nationalities with their neighboring country,

in this case China. In areas where we do not have any country with a large enough risk exposure, we

form geographic groups. These are Northern Africa, Sub-Sahara Africa, Central America and Middle

East. We proceed with the merging until each nationality has a total risk exposure of at least 100

years at risk over the period of 5 years. We then use the tree algorithm to determine the risk groups.

Figure 5.21 shows the cross-validation results with possible choices of the tree size. Figure 5.22 shows

●

●

●

● ● ● ● ● ● ● ● ● ● ● ● ●

6780

068

000

6820

0

cp

Cro

ss−V

alid

atio

n Er

ror

0.00369 7e−05 2e−05 1e−05 0 0 0 0 0 0 0

1 2 3 4 5 6 7 8 9 11 13 15Number of Leaves


Figure 5.21 – Stratified cross-validation results for the univariate regression tree predicting the expectedclaims frequency using the covariate nationality.

the fitted values of the tree predictor based on the elbow criterion as well as the Mid-Point rule. We

see that the tree with four leaves has su�cient volume in each leaf, hence there is no need to further

prune the tree towards the 1se tree. For the Mid-Point rule tree, nationality groups 5 to 7 have a

rather small volume, suggesting that an even larger tree would only be reasonable if the first risk

group is further split. However the large volume of the first risk group is mainly explained by the fact

that it contains the Swiss drivers. Comparing the results to the risk groups chosen by AXA shows

that the tree predictor is able to group categorical variables with many levels and small underlying

volumes to risk classes with larger volumes. This provides a large potential for applications of the

tree algorithm. Insurance companies are regularly faced with the problem of forming risk groups

of categorical variables without an a priori ordering between the levels. Instead of tedious manual

work, the tree algorithm is an alternative to obtain a risk-based grouping.


1 2 3 4

Nationality Group

Freq

uenc

y

0.00

0.05

0.10

0.15

0.20

0.25

0.00

0.05

0.10

0.15

0.20

0.25

020

4060

80

Volu

me

(in %

)

(a) Tree Size: Elbow criterion

1 2 3 4 5 6 7

Nationality Group

Freq

uenc

y

0.00

0.05

0.10

0.15

0.20

0.25

0.00

0.05

0.10

0.15

0.20

0.25

020

4060

80

Volu

me

(in %

)



Tari↵ Criterion 3: Price of Car

The next tari↵ criterion we analyze is the covariate price of car. Figure 5.23 provides the cross-

validation results as well as the fitted values corresponding to the pruned subtrees. The di↵erent

tree size selection criteria are indicated by the colored vertical lines.

●

●

●

●●

●●●●●●●●●●●●

●●●●●●●●●●●●●

●●●●●●●

●●●●●●●●●●●●

2220

022

250

2230

0

cp

Cro

ss−V

alid

atio

n Er

ror

0.0036 6e−05 3e−05 2e−05 2e−05 1e−05 1e−05 1e−05 1e−05

1 5 12 20 31 36 58 74 89 96Number of Leaves


(a) Cross-validation error

0 50000 100000 150000 200000

0.08

0.10

0.12

0.14

0.16

0.18

Price of Car

Cla

ims

Freq

uenc

y

1seMid−PointMin CV

(b) Fitted values for di↵erent tree sizes

Figure 5.23 – Stratified cross-validation results as well as fitted values for the univariate regression treepredicting the expected claims frequency using the covariate price of car.


The minimum CV tree shows signs of overfitting at 22’000 CHF as well as at 63’000 CHF. An

interesting features of both the Mid-Point and minimum CV tree is the additional split for very

expensive cars above 175’000 CHF. This may suggest that drivers with a car, whose price exceeds

a certain threshold start to drive more carefully. Another explanation might be that these are cars,

e.g. vintage cars, which are only driven at the weekends, i.e. there may be a correlation between

price and another covariate such as kilometres yearly driven for these expensive cars. The elbow

criterion tree is a subtree of the Mid-Point tree having one leaf less. However since the Mid-Point

tree does not show any sign of overfitting, we did not add the corresponding fitted values in Figure

5.23b. Finally Figure 5.24 illustrates the chosen risk classes with underlying volumes for the 1se and

Mid-Point rule trees. Both trees have reasonably large risk classes, suggesting the conclusion that

the Mid-Point rule is an appropriate choice for the tree size in this case.

<39 40−62 >62


Freq

uenc

y

0.00

0.05

0.10

0.15

0.00

0.05

0.10

0.15

020

4060

Volu

me

(in %

)

(a) Tree Size: 1se rule


Freq

uenc

y

<18

18−2

121−3

233−3

940−6

162−8

1

82−1

75>1

75

0.00

0.05

0.10

0.15

0.00

0.05

0.10

0.15

020

4060

Volu

me

(in %

)



Tari↵ Criterion 4: Kilometres Yearly Driven

Finally, we consider the criterion kilometres yearly driven. Figure 5.25 shows both the cross-validation

results as well as the fitted tree predictors for di↵erent choices of tree size. Note that the elbow

criterion coincides with the Mid-Point rule. The minimum CV tree shows clear signs of overfitting.

On the other hand, the 1se tree only uses one split. This can be interpreted that instead of using

the numerical covariate kilometres yearly driven, we could use a categorical covariate frequent driver.

According to the 1se rule, a frequent driver would drive more than 18’000 km per year. On the

other hand, the elbow criterion suggests a step-wise increase of the claims frequency in the amount

of kilometres yearly driven. Based on the findings in Section 5.3.3 of the 1se rule being consistently


over-pessimistic, we suggest the elbow criterion as the optimal choice in this case. Figure 5.26 shows

the fitted values with respect to the underlying volumes.

●

●

●

● ● ● ● ● ● ● ● ● ● ●

2224

022

260

2228

022

300

2232

0

cp

Cro

ss−V

alid

atio

n Er

ror

0.00092 0.00019 4e−05 1e−05 0 0 0 0 0 0

1 2 3 4 6 7 8 10 12 13 14 15 16 17

Number of Leaves

1seMid−Point / ElbowMin CV

(a) Cross-validation errors

0 5 10 15 20

0.08

0.09

0.10

0.11

0.12

Kilometres Yearly Driven (in 1'000 km)C

laim

s Fr

eque

ncy

1seMid−Point / ElbowMin CV

(b) Fitted values for di↵erent tree sizes

Figure 5.25 – Stratified cross-validation results as well as fitted values for the univariate regression treepredicting the expected claims frequency using the covariate kilometres yearly driven.

0−18 >18


Freq

uenc

y

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.00

0.02

0.04

0.06

0.08

0.10

0.12

020

4060

80

Volu

me

(in %

)

(a) Tree Size: 1se

<6.5 6.5−14 14−18 >18


Freq

uenc

y

0.00

0.02

0.04

0.06

0.08

0.10

0.12

0.00

0.02

0.04

0.06

0.08

0.10

0.12

020

4060

80

Volu

me

(in %

)

(b) Tree Size: Elbow criteria


5.5. BIVARIATE REGRESSION TREE FOR REAL DATA CLAIMS FREQUENCY 89

5.5 Bivariate Regression Tree for Real Data Claims Frequency

In this section, we analyze the fits of the tree algorithm obtained from the Motor Insurance Collision

Data Set using the covariates age of driver and price of car. Both have shown to be important

when conducting the univariate analysis in Section 4.2.2. The restriction to two covariates enables

us to understand the splits and interactions in detail. For the analysis we have prepared the data

of accounting year 2010 according to Data Set A described in Section 4.2.1. Having obtained a

reasonable fit, we then compare the predictions with the GLM model used by AXA on the out of

sample data of 2011 until 2014.

5.5.1 Tree Size Selection

Figure 5.27 shows the stratified cross-validation results of the fitted Poisson regression trees. Possible

tree sizes vary between 12 leaves when using the 1se rule and 28 for the minimum CV error rule.

Applying the elbow criterion seems to be rather delicate, since there is not a clear point beyond

which the cross-validation error stops to decrease. Probably one would choose a tree size between

the 1se and the Mid-Point rule.

●

●

●

●●●●

●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●

2110

021

150

2120

021

250

cp

Cro

ss−V

alid

atio

n Er

ror

0.00338 0.00035 0.00011 7e−05 5e−05 3e−05 3e−05 2e−05

1 4 7 10 15 20 27 31 35 44 49Number of Leaves


Figure 5.27 – Cross-validation results for the bivariate regression tree.

Figure 5.28 shows the fits obtained for di↵erent choices of tree size. In addition to the cross-validation

error based method for choosing the tree size, we also show the tree predictor resulting from statistical

inference pruning, see Section 5.3.4. The Wilcoxon test with significance level ↵ = 5% was used. The

resulting tree has a total of 18 leaves (again similar to the Mid-Point rule). The captured structure

of all methods is close to what we expect from the simulated data in Section 4.3.1. However, instead

of visual inspection, we will use out of sample testing to validate the goodness of fit in Section 5.5.3.


0.06

0.07

0.09

0.1

0.12

0.14

0.15

0.17

0.19

0.2

0.22

20 30 40 50 60 70 80 90

2030

4050

6070

80

Age

Price

(a) Tree Size: 1se

0.06

0.07

0.09

0.1

0.12

0.14

0.15

0.17

0.19

0.2

0.22

20 30 40 50 60 70 80 90

2030

4050

6070

80

Age

Price

(b) Tree Size: Mid-Point Rule

0.06

0.07

0.09

0.1

0.12

0.14

0.15

0.17

0.19

0.2

0.22

20 30 40 50 60 70 80 90

2030

4050

6070

80

Age

Price

(c) Tree Size: Minimum CV

0.06

0.07

0.09

0.1

0.12

0.14

0.15

0.17

0.19

0.2

0.22

20 30 40 50 60 70 80 90

2030

4050

6070

80

Age

Price

(d) Tree Size: Significance Pruning

Figure 5.28 – Fitted values for age 2 [18, 90] and price 2 [150000, 850000] of the bivariate tree predictorsfor di↵erent choices of tree size.

From Figure 5.28 we observe that the trees include multiple interactions. For example, when con-

sidering the 1se tree, we see that for cheap cars below 20’000 CHF, the age covariate is only used for

young drivers below 36 years.

5.5.2 Comparison: Tree and GLM

The standard statistical model used by the majority of insurance companies for the pricing of non-life

insurance products are generalized linear models (GLM). Therefore, we are interested in comparing


the tree predictors to a GLM model. Usually the GLM models are applied in a categorical way

where the risk classes need to be determined a priori. The performance of the GLM however strongly

depends on the chosen reasonable groups. Very small groups may heavily distort the predictions.

Furthermore, variable selection needs to be done in advanced or in a step-wise manner. Inclusion of

irrelevant covariates also negatively e↵ects the performance of a GLM. However since both age and

price are highly significant for predicting the claims frequency this is not an issue in this section.

The fitted GLM used for comparison with the tree predictors is based on the risk classes used by

AXA. In total, these are 8 risk classes for covariate age and also 8 risk classes for the covariate price.

Graphically, a GLM predictor can also be seen as a piecewise constant function fitting a constant in

each region specified by the risk classes.

In order to understand the di↵erences between the tree predictor and the GLM model, we fit a

bivariate GLM model using AXA’s risk groups and calculate the di↵erence between the fitted values

of the GLM and the tree predictors. Denote by �tree the tree predictor and by �GLM the GLM fit.

Figure 5.29 shows the di↵erence

�tree(age, price)� �GLM (age, price), (5.7)

as well as the relative di↵erence

�tree(age, price)� �GLM (age, price)

�GLM (age, price), (5.8)

for age 2 [18, 90] and price 2 [150000 � 850000]. Within the blue areas the tree predictor fits larger

values then the GLM and in the red areas vice versa. It reveals that for large parts the tree and the

GLM model provide very similar fits. The tree is more concerned about older drivers above 80 years.

On the other hand, for young drivers with expensive cars we observe larger frequencies within the

GLM model. This is because the GLM does not include interaction terms. It models factors for the

price classes and the age classes separately. In the univariate analysis we have seen that both young

age and expensive car indicate a high frequency. Therefore in the GLM these drivers are consider

extremely high risk.


−0.18

−0.16

−0.13

−0.11

−0.08

−0.06

−0.04

−0.01

0.01

0.04

0.06

20 30 40 50 60 70 80 90

2030

4050

6070

80

Age

Pric

e (in

1'0

00 C

HF)

(a) Di↵erence (5.7)

−0.49

−0.38

−0.28

−0.17

−0.06

0.05

0.16

0.26

0.37

0.48

0.59

20 30 40 50 60 70 80 90

2030

4050

6070

80

Age

Pric

e (in

1'0

00 C

HF)

(b) Relative di↵erence (5.8)

Figure 5.29 – Comparison of predicted claims frequencies: Tree vs GLM.

5.5.3 Out of Sample Performance

When comparing the methods with respect to cross-validation error, we need to be careful, since

we used cross-validation for model selection in the pruning process. In order to avoid this issue, we

compare the di↵erent methods out of sample. All methods were fit to the data of 2010, which was

preprocessed according to Data Set A. In addition to the tree models, we fit a GLM model with a

Poisson assumption and the risk groups used by AXA. On top, we also compare the methods to the

Null model, which predicts the overall portfolio average to all observations. Based on the findings of

Section 5.3.3, we calculate the mean out of sample deviance for each method, in order to evaluate

the performance. The mean sample deviance is given by,

1Pn

i=1 vi

2

nX

i=1

Ni

log

N

i

�(xi

)vi

!� (N

i

� �(xi

)vi

)

!, (5.9)

where N1, . . . , Nn

denote the claims counts in the out of sample test sets for the models:

1se: Tree predictor using 1se rule.

Mid-Point: Tree predictor using Mid-Point rule.

Min CV: Tree predictor using the minimum CV error rule.

Stat: Tree predictor based on statistical significance pruning.

GLM: Generalized linear model with Poisson assumption.


Null: Model predicting the overall portfolio average for all policies.

For the out of sample testing we only consider observations with a durations equal to one. The

results are shown in Table 5.1. Similar to the observation in Section 5.3.3, the relative di↵erences

Mean Out of Sample Deviance

Year 2011 2012 2013 2014

1se 0.7231008 0.6912631 0.7767323 0.7815325Mid-Point 0.7227214 0.6911108 0.7764441 0.7812819Min CV 0.7226967 0.6910413 0.7764235 0.7811854Stat 0.7228365 0.6911267 0.7765412 0.7813254GLM 0.7227862 0.6911124 0.7763417 0.7812630Null 0.7271414 0.6956903 0.7811100 0.7860495

Table 5.1 – Mean out of sample deviance (5.9) for accounting years 2011 to 2014. All methods were fitbased on the data of accounting year 2010. The best method for each year is highlighted in bold face.

between the models are relatively small. However, all models outperform the Null model. In order

to visualize the di↵erence between the models, Figure 5.30 shows the improvements in mean sample

deviance compared to the Null model. We see that the 1se tree consistently performs worst. This

●

●

●

●

0.00

420.

0044

0.00

460.

0048

Year

Impr

ovem

ent v

ersu

s N

ull M

odel

2011 2012 2013 2014

●

●

●

●

●

●

●

●

●

●

●●

●

● ●

●

1seMid−PointMin CVStatGLM

Figure 5.30 – Improvements of the mean out of sample deviance versus the Null model for the bivariateclaims frequency analysis.

is in line with our observations of the simulation study in Section 5.3.3. Except for accounting year

2013, the minimum CV tree has a smaller mean sample deviance than the GLM model. However

based on the variability between di↵erent years, we conclude that both methods (minimum CV tree

and GLM) have a similar performance.

5.6. MULTIVARIATE REGRESSION TREE FOR REAL DATA CLAIMS FREQUENCY 94

5.6 Multivariate Regression Tree for Real Data Claims Frequency

In a next step, we add further covariates to the model. We first consider all variables that are

also used by AXA in their GLM model, i.e a total of 11 covariates (see Table 5.2). The data is

preprocessed according to both Data Set A and B described in Section 4.2.1. Note that due to the

larger number of covariates, Data Set A may still contain a significant amount of outliers due to

small durations. Hence we fit two tree predictors:

1. Tree based on Data Set A, denoted by ’Tree (A)’,

2. Tree based on Data Set B, denoted by ’Tree (B)’.

Comparing the cross-validation results, shown in Figure 5.31, reveals that using Data Set B results

in larger trees. Possible tree sizes range from 40 leaves (Tree (A), 1se) up to 75 leaves (Tree (B), Min

CV). Figure 5.32 shows the variable importance plots for the minimum CV trees indicated by the

●

●

●

●

●●

●●●

●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

2055

020

650

2075

020

850

cp

Cro

ss−V

alid

atio

n Er

ror

0.00341 0.00013 7e−05 5e−05 4e−05 3e−05 3e−05 2e−05 1e−05

1 15 28 41 56 71 86 107 127 149 170 195 217 240 259

Number of Leaves


(a) Data Set A

●

●

●●

●

●

●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●●●●

1970

019

800

1990

020

000

cp

Cro

ss−V

alid

atio

n Er

ror

0.00366 2e−04 9e−05 7e−05 5e−05 4e−05 3e−05 2e−05 2e−05

1 12 25 37 51 65 78 95 114 133 157 177 195 215 232

Number of Leaves


(b) Data Set B

Figure 5.31 – Stratified cross-validation error results for multivariate tree predictors based on Data SetA and B.

black vertical lines in Figure 5.31. It shows a very similar structure for both trees. The interpretation

of the variable importance should be done with caution. As discussed in Section 3.4.2, trees may look

rather di↵erent when changing the learning sample slightly. For a more robust measure of variable

importance, we should consider the average of many trees as done in bagging, see Section 6.3.2.

Nevertheless the results are in line with the significance levels that the covariates show in the GLM

model analysis.

5.6. MULTIVARIATE REGRESSION TREE FOR REAL DATA CLAIMS FREQUENCY 95Va

riabl

e Im

porta

nce

Price

Deduc

t.Age

Bonus Nat.

Leas

ing

Car Age

Weight

Region

Car Clas

s Km

020

4060

8010

0

(a) Data Set A

Varia

ble

Impo

rtanc

e

Price

Bonus

Deduc

t.Age

Car Age Nat.

Weight

Leas

ing

Region

Car Clas

s Km

020

4060

8010

0

(b) Data Set B

Figure 5.32 – Variable importance for the minimum CV tree predictors on Data Set A and B.

In addition to the tree models above, we fit a further tree predictor, denoted by ’Tree Plus’, containing

the additional variables O�ce, Horse Power, Policy Age and Sex. Table 5.2 summarizes the covariates

included in the di↵erent models. In Figure 5.33 we see that the cross-validation results for the Tree

Variables

Model: GLM Tree Tree Plus

Age* Age AgePrice* Price PriceNat* Nat* Nat*Km* Km KmWeight* Weight WeightBonus Bonus BonusLeasing Leasing LeasingCar Class Car Class Car ClassDeductible Deducible DeducibleRegion Region RegionCar Age* Car Age Car Age

O�ceHorse PowerPolicy AgeSex

Table 5.2 – The star (*) indicates that the variable was grouped in advanced. The grouping was chosento optimize the performance of the GLM, except for Nationality where we used the grouping of AXA forboth GLM and Tree predictors.

Plus predictor are very similar to the previous models. However there are several changes with respect

to the variable importance, shown in Figure 5.34. As before the variable importance measure must

be seen with caution due to the tree instability but also due to correlations among the covariates.

Some of the additional covariates, such as Horse Power and Price or Policy Age and Age are highly

5.6. MULTIVARIATE REGRESSION TREE FOR REAL DATA CLAIMS FREQUENCY 96

●

●●

●

●●●●●●

●●●●●●●●

●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

2060

020

700

2080

020

900

cp

Cro

ss−V

alid

atio

n Er

ror

0.00449 0.00044 0.00019 0.00012 1e−04 9e−05 7e−05 7e−05 6e−05 5e−05 5e−05

1 6 12 19 26 33 40 49 56 68 76 85 92 102Number of Leaves


Figure 5.33 – Stratified cross-validation error results for the multivariate tree predictor containing ad-ditonal covariates.

Varia

ble

Impo

rtanc

e

Policy A

geWeig

htPric

e

Deduc

t.Age

Bonus Nat.

Office

Horse P

ower

Region

Leas

ing

Car Age

Car Clas

s Km Sex

020

4060

8010

0

(a) Tree (A) Plus

Varia

ble

Impo

rtanc

e

Policy A

gePric

eWeig

ht

Deduc

t.Age

Bonus

Office

Nat.

Region

Leas

ing

Car Age

Car Clas

s Km Sex

020

4060

8010

0

(b) Tree (A) Plus without Horse Power

Figure 5.34 – Variable importance and cross-validation results for Tree (A) Plus predictor.

correlated (see Appendix, Figure B.1), meaning they measure a similar e↵ect. For example the

importance of the covariate Horse Power is now shared with the covariate Price. Although the

variable importance measure of Definition 3.21 takes surrogate splits, i.e. correlations, into account,

when removing either covariate the importance of the remaining may increase. This is supported

by Figure 5.34b, which shows the variable importance of the predictor Tree (A) Plus where the

covariate Horse Power was excluded. Overall this results in an increase of the variable importance

of the remaining covariate Price. Therefore further research is needed to analyse to what extent

the variable importance according to Definition 3.21 is influenced by correlated covariates. However

since the importance of Policy Age is significantly larger than Age, Figure 5.34a may suggest to use

Policy Age rather than Age in the GLM model. We conclude that the variable importance measure

can also be used for variable selection. In particular, if the number of covariates becomes much

larger, the tree algorithm and associated variable importance may help to determine which variables

5.7. SUMMARY 97

to include in a GLM model.


Analogously to the bivariate case in Section 5.5.3, we evaluate the performance of the above predictors

out of sample. The validation data set contains all observations of accounting years 2011 to 2014

according to Data Set A with a full duration of one year at risk. The mean out of sample deviance

(5.9) for each accounting year and method is shown in Table 5.3. Figure 5.35 shows the improvements


Year 2011 2012 2013 2014

Min CV (A) 0.7196545 0.6863301 0.7702547 0.77375761se (A) 0.7199627 0.6870283 0.7708778 0.7742580Mid-Point (A) 0.7198493 0.6866617 0.7706358 0.7740121Min CV Plus (A) 0.7200256 0.6868270 0.7706587 0.7740368Min CV (B) 0.7196465 0.6864650 0.7701494 0.77382391se (B) 0.7205028 0.6873175 0.7707433 0.7744600Mid-Point (B) 0.7199539 0.6867904 0.7703110 0.7738522GLM (A) 0.7172632 0.6843318 0.7688065 0.7715159Null (A) 0.7290833 0.6957688 0.7808736 0.7842622

Table 5.3 – Mean out of sample deviance for numerous multivariate models. The capital letter (A)and (B) indicate the data set, which was used for fitting the method. The best method for each year ishighlighted in bold face.

in mean sample deviance compared to the Null model. The GLM model performs consistently best

for all years. For the tree predictors we again observe that the 1se rule is not able to fit a tree

large enough to capture all relevant features and therefore performing consistently worst. The tree

predictor Min CV Plus (A), derived from the model with additional covariates, preforms slightly

worse than the corresponding fit with less variables. Unfortunately, in contrast to the bivariate fit,

the tree predictors are not able to compete with the GLM model. A possible explanation may be a

large increase in variance of the tree predictor when including more explanatory variables. This is

supported by the observation that the predictor Min CV Plus (A) performs worse then Min CV (A)

and Min CV (B).

5.7 Summary

In the present chapter we applied the tree algorithm both to simulated data and the Motor Insur-

ance Collision Data Set of AXA Winterthur. In Section 5.3.2, we have observed that, due to the

unbalanced nature of our data, the improvements of the optimal tree predictor to the Null model

are relatively small. However using stratified cross-validation, we managed to significantly decrease

the uncertainty estimates of the cross-validation error. This is crucial since the most common se-

5.7. SUMMARY 98

● ●

●●

0.00

80.

010

0.01

2

Year

Impr

ovem

ent v

ersu

s N

ull M

odel

2011 2012 2013 2014

●

●

● ●

●●

● ●

●●

●

●

●●

●

●

●●

●●

●

●

●

●

●●

● ●

Min CV (A)1SE (A)Mid−Point (A)Min CV (B)1SE (B)Mid−Point (B)Min CV (Plus)GLM

Figure 5.35 – Improvements of the mean out of sample deviance versus the Null model for the mul-tivariate claims frequency analysis.

lection for the tree size, called 1se rule, takes both the cross-validation error and its uncertainty

estimate into account. Using simulated data, we were able to compute the optimal tree size with

respect to the absolute frequency deviation (5.1). The analysis showed, even when using stratified

sampling, that the one standard error rule is consistently over-pessimistic regarding the tree size.

The simulation suggested that on average the minimum cross-validation error tree provides the best

fit. However note that large variability in the chosen tree size is associated to the Min CV rule,

see Figure 5.9. Our simulation also showed that the out of sample deviance provides very similar

performance results as the absolute frequency deviation. This is of great importance since the ab-

solute frequency deviation is not available if the true underlying model is not known. Analyzing

the Motor Insurance Collision Data Set, we have observed that the Anova method is note able to

correctly handle observations without a full year at risk. On the other hand, the methods Poisson

and Weighted Anova meaningfully include the underlying volumes and also provide very similar fits.

Since the Poisson method can be used in combination with the Poisson deviance, we conclude that

for our data the Poisson method is the preferable choice. Using the tree algorithm to fit univariate

regression trees, we were able to form subporfolios with similar risk characteristics, see Section 5.4.

This provides many potential applications in practice. In particular, when faced with unordered

covariates for which a priori no structure is known. In Section 5.6, we fit a multivariate regression

tree for the claims frequency. However using the out of sample deviance as a measure of performance,

we conclude that a multivariate tree predictor is not able to compete with a classical GLM model in

terms of predictive capacity.

Chapter 6

Data Analysis: Bagging

In this chapter, we apply the tree ensemble method bagging to insurance claims data. We use the

same structure as in the previous chapter: In the beginning, we analyze the algorithm using simulated

data and in a second step we apply the method to the Motor Insurance Collision Data Set.

6.1 Fitting a Bagged Tree Predictor in R

A bagged tree predictor can be obtained straightforward from the rpart package. For that purpose,

we have written a wrapper-function, which fits B regression trees to a bootstrapped data set and

returns a list storing all B predictors. The function can be used analogously to the rpart function

with two additional parameters maxnodes, for the individual tree size and ntree for the number of

bagged tree predictors to be fit. The corresponding R-Code can be found in Appendix C.4. We can

fit a bagged tree predictor using the function bagging as follows:

bagged <- bagging(cbind(v, N)~ x1+...+xp, data = data,

method="poisson", parms = list(shrink = 3.5),

control=rpart.control(minbucket=4000, cp=0.0002),

maxnodes = 8, ntree = 50)

Remark 6.1. Note that the cp parameter must be chosen small enough, such that the unpruned

tree has at least maxnodes leaves. J

In order to fit a bagged tree predictor, we need to determine two additional parameters: maxnodes

and ntree. Regarding the number of bootstrapped tree predictors (ntree), larger values provide a

better approximation of the bootstrap expectation of equation (3.16). Therefore the choice of ntree

is only limited by computing time. The size of the individual trees on the other hand is more directly

linked to the quality of the fit. From Section 3.4.1, we know that the benefit of bagging results purely

99

6.2. BAGGING FOR SIMULATED CLAIMS FREQUENCY 100

from a variance reduction. Since the bagged tree predictor has the same bias as an individual tree,

we can use our understanding of the regression tree methodology (see Chapter 3) for an optimal

choice of maxnodes. The choice of both parameters in the context of insurance claims count data is

discussed in greater details in the next section.

6.2 Bagging for Simulated Claims Frequency

Analogously to Section 5.3, we start the analysis on a simulated data set. We use the model intro-

duced in Section 4.3 with the parameters given in Assumption 4.9.

6.2.1 Tree Size of Individual Trees

First, we investigate the impact of the tree size upon the bagged tree predictor. For that purpose,

we have generated a learning sample L using Assumption 4.9. The first predictor was fit using

maxnodes = 5, the second using maxnodes = 50 and the third using maxnodes = 100. For all cases

we have used ntree = 100. Figure 6.1 summarizes the results and indicates that maxnodes = 5 is too

small. The bagged predictor is not capable of capturing the whole shape of the underlying structure.

For maxnodes = 50, the resulting fit is much closer to the true function. When increasing maxnodes

to 100, we see that the predictor becomes more wiggly, which we can interpret as a sign for overfitting.

The reason for this behavior is due to the fact, that improvements through bagging are purely based

on a variance reduction. As shown in Section 3.4.1, the bias of the bagged predictor equals the bias

of a single tree. Considering the mean squared error as a measure of goodness, we should choose

maxnodes such that the bias of an individual tree is minimized in order to get an optimal bagged

tree predictor. Recall from Proposition 2.2, that we have the following decomposition,

E⇣(Y � Y )2

⌘= E

✓Bias

⇣f(X,L)|X

⌘2◆+ E

⇣V ar

⇣f(X,L)|X

⌘⌘+ V ar("). (6.1)

In Chapter 3, we have seen that a tree predictor with roughly 30 leaves minimizes the generalization

error. Due to the bias-variance-trade-o↵ and the fact that the generalization error of the individual

tree itself splits into bias and variance according equation (6.1), we expect the optimal maxnodes

parameter to be slightly larger than 30. The reason for a choice slightly larger than the minimum

cross-validation error tree is due to the bias-variance-trade-o↵ in the number of leaves, where we

expect the bias to decrease resp. the variacne to increase when growing a larger tree.


20 30 40 50 60 70 80 90

0.05

0.10

0.15

0.20

Age of Driver

Cla

ims

Freq

uenc

y

20 30 40 50 60 70 80 90

0.05

0.10

0.15

0.20

Age of Driver

Cla

ims

Freq

uenc

y

(a) Parameters: maxnodes = 5, ntree = 100

20 30 40 50 60 70 80 90

0.05

0.10

0.15

0.20

Age of Driver

Cla

ims

Freq

uenc

y

20 30 40 50 60 70 80 90

0.05

0.10

0.15

0.20

Age of Driver

Cla

ims

Freq

uenc

y

(b) Parameters: maxnodes = 50, ntree = 100

20 30 40 50 60 70 80 90

0.05

0.10

0.15

0.20

Age of Driver

Cla

ims

Freq

uenc

y

20 30 40 50 60 70 80 90

0.05

0.10

0.15

0.20

Age of Driver

Cla

ims

Freq

uenc

y

(c) Parameters: maxnodes = 100, ntree = 100

Figure 6.1 – Simulation Results: Bagged tree predictors for di↵erent values of maxnodes.


6.2.2 Number of Trees

Theoretically, we can choose ntree as large as we want. However, limitations are imposed by com-

puting time. In order to understand the dependency on the number of bagged trees, we repeated the

analysis of the previous section for ntree 2 {10, 100, 200}. The fitted predictors are shown in Figure

6.2. It shows that for our data set ntree = 100 seems to be reasonably large.

20 30 40 50 60 70 80 90

0.05

0.10

0.15

0.20

Age of Driver

Cla

ims

Freq

uenc

y

Modelntree = 10, maxnodes = 5ntree = 100, maxnodes = 5ntree = 200, maxnodes = 5

20 30 40 50 60 70 80 90

0.05

0.10

0.15

0.20

Age of Driver

Cla

ims

Freq

uenc

y

Modelntree = 10, maxnodes = 50ntree = 100, maxnodes = 50ntree = 200, maxnodes = 50

Figure 6.2 – Impact of parameter ntree on the bagged tree predictor.


In this section, we compare the predictive power of the bagged tree predictor to the single tree

predictor using the out of sample deviance statistics introduced in Section 5.3.3. We have generated

10 independent training data sets Lk

for k = 1, . . . , 10 using Assumption 4.9 and an independent

validation data set with n = 100000000 observations. Based on the training data sets we have fit four

predictors each:

1. Bagged Tree: A bagged tree predictor using maxnodes = 50 and ntree = 100.

2. Min CV Tree: A pruned tree predictor according to the minimum CV error rule using stratified

cross-validation.

3. 1se Tree: A pruned tree predictor according to the 1se rule using stratified cross-validation.

4. GLM Model: A GLM with Poisson assumption model based on AXA’s risk classes.

The resulting deviances are shown in Figure 6.3. The simulation study shows that bagging not only

improves the accuracy of the tree predictor, but also reduces the variability of the out of sample

deviance statistics. Note that bagging results in a deviance reduction even tough in general we could

6.3. BAGGING FOR REAL DATA CLAIMS FREQUENCY 103

only prove this for the mean squared error, see Section 3.4.1. Additionally, we observe that whereas

the GLM outperforms the classical regression trees, itself is again outperformed by the bagged tree

predictor.

●

Bagging Min CV 1se GLM4696

0047

0000

4704

0047

0800

Out

of S

ampl

e D

evia

nce

Figure 6.3 – Performance of the bagged tree predictor in comparison to other models using the out ofsample deviance.

6.3 Bagging for Real Data Claims Frequency

In this section, we fit the bagged tree predictor to the Motor Insurance Collision Data Set. We start

with a bivariate predictor and then generalize to higher dimensions.

6.3.1 Bivariate Bagged Tree Predictor

To begin with, we fit a bagged tree predictor using the covariates age of driver and price of car,

analogously to Section 5.5. Based on the findings of Section 5.5.3, we choose maxnodes = 50 and

based on Section 6.2.2 we set ntree = 100. Figure 6.4b shows the resulting bagged tree predictor

and Figure 6.4a an individual tree used for aggregation. We see that bagging reduces the discrete

nature of a regression tree. Table 6.1 compares the mean out of sample deviance of the bagged tree

predictor with the optimal tree predictor (see Section 5.5.3) as well as the GLM and the Null model

introduced Section 5.5. We conclude that bagging has significantly improved the tree predictor.

Making it consistently better then the GLM model.


0.05

0.07

0.09

0.11

0.13

0.15

0.17

0.19

0.21

0.23

0.25

20 30 40 50 60 70 80 90

2030

4050

6070

80

Age

Price

(a) Individual tree predictor

0.05

0.07

0.09

0.11

0.13

0.15

0.17

0.19

0.21

0.23

0.25

20 30 40 50 60 70 80 90

2030

4050

6070

80

Age

Price

(b) Bagged tree predictor

Figure 6.4 – Visualization of the bivariate bagged tree predictor.


Year 2011 2012 2013 2014

Bagging 0.7241693 0.6907991 0.7757639 0.7786711

Tree 0.7246993 0.6911062 0.7763071 0.7792221

GLM 0.7247417 0.6911917 0.7760849 0.7791561

Null 0.7290833 0.6957688 0.7808736 0.7842622

Table 6.1 – Out of sample performance of the bivariate bagged tree predictor versus other methods.The best method for each year is highlighted in bold face.

6.3.2 Multivariate Bagged Tree Predictor

Next we generalize the predictor using additional covariates. First we fit a multivariate bagged tree

predictor using the same covariates as included in AXA’s GLM model. According to Section 5.6,

we expect the optimal tree size of an individual tree to be roughly between 60 and 80 leaves. Based

on the findings in Section 6.2, we set the parameter ntree = 100 and compare di↵erent choices

of the tuning parameter maxnodes 2 {60, 70, 80, 90, 100}. The performance was evaluated out of

sample using the mean out of sample deviance (5.9). The results are summarized in Table 6.2.

The di↵erent predictors are named according to the chosen maxnodes parameter. Analogously to

Section 5.6, we also show the results when adding further covariates to the predictor (see Table 5.2).

The corresponding out of sample performances for various choices of parameters are also shown in

Table 6.2 and denoted by the su�x (l). In contrast to the single tree predictor, see Section 5.6,


adding additional covariates results in an improvement of the predictive capacity. The predictors


Year 2011 2012 2013 2014

m60 0.7177994 0.6849366 0.7683892 0.7720549m70 0.7175936 0.6847973 0.7682821 0.7719311m80 0.7174927 0.6847499 0.7681714 0.7718043m90 0.7173910 0.6846422 0.7680176 0.7716766m100 0.7143560 0.6823380 0.7720946 0.7759560

m80 (l) 0.7174013 0.6843935 0.7678068 0.7715305m90 (l) 0.7212701 0.6776055 0.7749489 0.7787537m100 (l) 0.7213253 0.6770380 0.7750003 0.7787436

Table 6.2 – Performance of the bagged tree predictor for di↵erent choices of tuning parameters. Thesu�x (l) marks the models with additional covariates according to the Tree Plus model shown in Table5.2. The best method for each year is highlighted in bold face.

with the best out of sample performance are highlighted in bold face. We see that for both models

maxnodes 2 [80, 90] seems to be a reasonable choice of the tuning parameter. This is in line with our

expectation based on the analysis of the multivariate tree predictor in Section 5.6 and the discussion

in Section 6.2.1.


So far we have fit both bivariate and multivariate bagged tree predictors and discussed the choice

of the tuning parameter maxnodes. Next, we compare their performance out of sample to the

multivariate tree predictor (see Section 5.6), the GLM model as well as the Null model. Again, we

use the mean out of sample deviance as a measure of performance. Table 6.3 shows the corresponding

values for each method and accounting year. In Figure 6.5 we show the improvements with respect


Year 2011 2012 2013 2014

Bagged (m90) 0.7173910 0.6846422 0.7680176 0.7716766Bagged (m80, l) 0.7174013 0.6843935 0.7678068 0.7715305Bagged (bivariate) 0.7241693 0.6907991 0.7757639 0.7786711Tree 0.7196545 0.6863301 0.7702547 0.7737576GLM 0.7172632 0.6843318 0.7688065 0.7715159Null 0.7290833 0.6957688 0.7808736 0.7842622

Table 6.3 – Out of sample performance of the multivariate frequency models. The best method for eachyear is highlighted in bold face.

to the mean out of sample deviance of the Null model. We conclude that the multivariate bagged

tree predictor is able to compete in terms of predictive capacity with the GLM model. In contrast to


●

●

●

●

0.00

90.

010

0.01

10.

012

0.01

3

Year

Impr

ovem

ent v

ersu

s N

ull M

odel

2011 2012 2013 2014

●

●

●

●

● ●

●●

●

●

●

●

Bagging (maxnodes = 90)Bagging (maxnodes = 80, large)Tree (Min CV)GLM

Figure 6.5 – Improvements of the mean out of sample deviance versus the Null model for the multivariateclaims frequency analysis.

Section 5.6, the inclusion of additional covariates increased the predictive capacity. This shows the

ability of the bagged tree predictor to cope with many (possibly correlated) covariates. In situations

were the structure of the data is not known a priori, this is a large benefit over the GLM model. Note

that we have used out of sample testing to calibrate the parameter maxnodes. Hence the comparison

against the GLM is slightly biased in favour of the bagged tree predictor.

6.3.4 Visualization

Bagging is a black box model, i.e. there is no direct way to fully describe the fitted predictor in a

closed form. However for an insurance company it is essential to not only achieve high predictive

capacity but also to understand the influence of individual covariates on the predicted values. In

order to understand the rational behind the predictions, we proceed with visualizing the model by

applying the methods introduced in Section 3.4.4.

Variable Importance:

First we compare the variable importance of the bagged tree predictors Bagging (m90) and Bagging

(m80, l). Figure 6.6 shows the importance plot for both predictors. Comparing the results to the

variable importance plots of the multivariate tree predictors shown in Figures 5.32 and 5.34 we

observe a similar picture. We again note that some of the additionally included covariates appear

to have a very prominent impact. Similar to the discussion for multivariate trees in Section 5.6,

caution is required when interpreting the variable importance. Correlations among the covariates

can disturb the picture. Figure B.1 in the Appendix shows the presence of high correlations between

the original and additionally added covariates. For example Horse Power and Price are highly

6.3. BAGGING FOR REAL DATA CLAIMS FREQUENCY 107Va

riabl

e Im

porta

nce

Price

AgeWeig

ht

Deduc

t.

Region Nat.

Bonus

Car Age

Leas

ing

Car Clas

s Km

020

4060

8010

0

(a) Bagging (m90)

Varia

ble

Impo

rtanc

e

Policy A

gePric

eWeig

htOffic

eAge

Deduc

t.

Bonus

Horse P

ower

Car Age Nat.

Car Clas

s

Region

Leas

ing Km Sex

020

4060

8010

0

(b) Bagging (m80, l)

Figure 6.6 – Variable importance plots for the bagged tree predictors Bagging (m90) and Bagging (m80,l).

correlated, removing Horse Power may significantly increase the importance of Price. Note since we

have seen that the tree predictor is able to filter informative from non-informative covariates (see

Section 5.3.6), the variable importance plot of a bagged tree predictor provides a useful tool for

variable selection if a large number of explanatory variables is available.

Individual Conditional Expectation Plots:

Next we present the partial dependence and individual conditional expectation (ICE) plots (see

Section 3.4.4) for the bagged tree predictor Bagging (m90). Figure 6.7 shows the numerical covariates

and Figure 6.8 the categorical covariates. Overall we observe very similar characteristics as expected

from the univariate analysis in Section 4.2.2, suggesting that the bagged tree predictor is able to

capture all relevant features of the Motor Insurance Collision Data Set. The results are also in line

with the variable importance plots shown in Figure 6.6. For example the covariate kilometres yearly

driven does not show much di↵erentiation over the whole range of values, which supports the findings

of Figure 6.6 that it has only a marginal influence.

Also note that the majority of lines for all covariates are parallel. This indicates that there are

no dependencies of the form that the influence of a particular covariate changes, based on changes

in other covariates, see Section 3.4.4 for more details. For better visibility, Figures 6.7 and 6.8 show

the individual conditional expectation curves for a random subsample (n = 10000) of the data. For

the categorical covariate we have connected the predictions to maintain the interpretation of parallel

curves.


Figure 6.7 – Individual conditional expectation plots for numerical covariates.

Figure 6.8 – Individual conditional expectation plots for categorical covariates.

6.4. SUMMARY 109

6.4 Summary

In the present chapter, we applied the bagged tree predictor to insurance claims data. In Section

6.2, we observed that maxnodes is the relevant parameter in terms of predictive capacity. We

conclude that a reasonable choice can be obtained when using a value slightly larger than the size

of the minimum cross-validation error tree. For the parameter ntree, we can say that the larger the

better. However in Section 6.2.2, we have seen that for our data ntree = 100 is already su�ciently

large to obtain a good approximation of the bootstrap expectation. Although we could not prove

that bagging positively a↵ects the predictive capacity in terms of the Poisson deviance statistics, we

observed an improvement of the bagged tree predictor compared to an individual tree predictor (see

Figure 6.3 and Figure 6.5). When applying the method to the Motor Insurance Collision Data Set

we observed that in contrast to the individual tree predictor including additional variables does not

negatively a↵ect the predictive capacity. However caution is required when interpreting the variable

importance plots. In terms of out of sample performance, the bagged tree predictor was able to

compete with the GLM model (see Figure 6.5). We conclude that bagging provides a competitive

alternative to the well-known GLM model. In particular, it can also be used when little or nothing

is known about the structure of the data. It is also able to deal with many covariates including

non-informative or correlated variables. In order to visualize the model, we have analyzed the ICE

plots. The observed characteristics are in line with our observations of the univariate analysis.

Chapter 7

Data Analysis: Random Forest

In this section, we apply the Random Forest algorithm introduced in Section 3.4.3 to insurance

claims data. In the beginning, we analyze the algorithm using simulated data and in a second step

we apply the method to the Motor Insurance Collision Data Set.

7.1 Fitting a Random Forest Predictor in R

We use the R package randomForest to fit the random forest predictor. Note that the implementation

uses the mean squared error as the minimization criterion. The syntax to fit a random forest is very

similar to bagging. Let y = N/v be the response and x1, . . . , xp the covariates, then a random forest

can be fit by:

rf <- randomForest(y ~ x1+...+xp, data = data, mtry = floor(p/3), ntree = 100,

maxnodes = 30)

The additional parameter mtry is used to specify the number randomly selected covariates among

which the optimal splits is searched for, see Algorithm 3.3.

Unfortunately, there does currently not exist an implementation of the random forest algorithm

in R, which is based on the weighted mean squared error or the Poisson deviance statistics. This is

not particularly optimal considering an application to insurance claims data, which contains outliers

due to small durations. A further drawback of the randomForest implementation is that is does not

support surrogate splits and therefore cannot handle missing data.

7.2 Random Forest for Simulated Claims Frequency

In order to analyze the performance of the random forest algorithm, we perform a simulation study,

where we simulate 10 training data set Lk

for k = 1, . . . , 10 based on Assumption 4.9. Additionally,

110

7.3. RANDOM FOREST FOR REAL DATA CLAIMS FREQUENCY 111

we generate a test data set with n = 100000000 observations. Next we fit the random forest predictor

to each training data set and validate the performance out of sample on the test data set. Note that

when simulating from Assumption 4.9, the data only contains two covariates. On the other hand,

random forest is designed for high-dimensional problems, hence may not unfold all its potential in

this simulation study. In particular the model is not suitable to analyze the tuning parameter mtry.

However, we can use it to test di↵erent choices of maxnodes 2 {5, 10, 20, 30, 50, 70, 90, 110}. We set

mtry = 1 and based on Section 6.2.2 we use ntree = 100. Figure 7.1 shows the performance of the

random forest predictors and in addition the GLM model as well as the bagged tree predictor. We

observe that the optimal choice of maxnodes is again obtained at 50. This is again explained by

the fact that random forest is based on a variance reduction whilst only marginally changing the

bias, for more details see Section 6.2.1. Although the performance of the random forest predictor is

slightly worse than the bagged tree predictor, this may change when considering a problem of higher

dimensionality as explained above. Compared to the GLM model, the random forest predictor

performs consistently better. Also note that the slightly poorer performance of the random forest

●

4695

0047

0000

4705

0047

1000

4715

00O

ut o

f Sam

ple

Dev

ianc

e

rf (m =

110)

rf (m =

90)

rf (m =

70)

rf (m =

50)

rf (m =

30)

rf (m =

20)

rf (m =

10)

rf (m =

5)

Baggin

gGLM

Figure 7.1 – Performance of the random forest for di↵erent choices of maxnodes (abbreviated by m),the optimal bagged tree predictor as well as the GLM model on a simulated claims frequency data set.

predictor compared to the bagged tree predictor may originate from the fact that the random forest

implementation is based on minimizing the mean squared error, whereas for our problem the Poisson

deviance statistics seems more appropriate. This issue becomes much more severe when the data

contains a large amount of observations without a full duration.

7.3 Random Forest for Real Data Claims Frequency

Next we apply the random forest algorithm to the Motor Insurance Collision Data Set. We include

the covariates used in AXA’s GLM model. The training data was preprocessed according to both

Data Set A and B, see Section 4.2.1. Since the random forest algorithm does not take the volume


explicitly into account, we expect the performance to be negatively a↵ected by small durations.

Therefore we perform the analysis on both data sets. We set the number of random forest trees

to ntree = 100 and the size of a single tree to maxnodes = 90 based on the finding in Section

6.3.2 as well as Section 7.2. For the tuning parameter mtry we fit two random forest predictors with

mtry 2 {3, 8}, where 3 corresponds to the rule of thumbmtry = bp/3c. The performance is evaluated

out of sample using the mean sample deviance. Table 7.1 shows the obtained results including the

Null model for comparison. We see that the random forest predictors is not even capable to compete

with the Null model. In order to understand the reason for the bad performance, we have shown


Year 2011 2012 2013 2014

randomForest (A, mtry = 3) 0.8093080 0.7900853 0.8774162 0.8656559randomForest (A, mtry = 8) 0.8270472 0.8099446 0.9041067 0.8947512randomForest (B, mtry = 3) 0.7889306 0.7663895 0.8546950 0.8434086randomForest (B, mtry = 8) 0.8124648 0.7953953 0.8939658 0.8755047Null 0.7290833 0.6957688 0.7808736 0.7842622

Table 7.1 – Out of sample performance of various random forest predictors based on di↵erent trainingdata sets and choice of mtry parameters. For comparison the performance of the Null model is alsoshown.

the ICE plots of the covariates age and price in Figure 7.2 based on the random forest predictor

randomForest(B, mtry = 3). It reveals that the predicted values are heavily distorted.

0 50000 150000 250000

020

4060

80

Price

Parti

al D

epen

denc

e

20 40 60 80 100

020

4060

80

Age

Parti

al D

epen

denc

e

Figure 7.2 – ICE plot for the random forest predictor randomForest(B, mtry=3) showing heavily dis-torted predictions.


7.3.1 Restriction to Full Durations

Based on the findings of Figure 7.2, we suspect that the random forest algorithm is not able to deal

with frequency outliers due to small durations. Since preparing the data according to Data Set B

did not solves the issue, we fit the random forest predictor again, however only using observations

with a full year at risk. To asses the quality of the predictor, the mean out of sample deviance was

evaluated. Table 7.2 shows the results for the random forest predictors as well as the Null model. We


Year 2011 2012 2013 2014

randomForest (B, yr = 1, mtry = 3) 0.7184949 0.6834849 0.7704677 0.7745797randomForest (B, yr = 1, mtry = 8) 0.7188954 0.6829554 0.7706336 0.7693152Null 0.7290833 0.6957688 0.7808736 0.7842622

Table 7.2 – Out of sample performance for random forest predictors based on the data set restricted tofull durations only. The best method for each year is highlighted in bold face.

observe that when removing the small durations, the random forest predictor drastically improves.

Note that so far, we always fixed maxnodes = 90. However we may conjecture that the optimal

choice of maxnodes may also depend on the parameter mtry. In order to better understand the

impact of the tuning parameters, we have performed a grid search for the optimal parameters varying

mtry 2 {2, 4 . . . , 10} and maxnodes 2 {70, 80, 90, 100, 110, 120}. For the analysis we fit the random

forest predictor on the data of 2010 (restricted to full durations) and evaluated the mean out of

sample deviance on the data of 2011. The results are shown in Figure 7.3, where the performance is

visualized graphically from best (white) to worst (dark blue). This suggests to use the parameters

0.71790.718020.718140.718270.718390.718510.718630.718760.718880.7190.7191370 80 90 100 110 120

2

4

6

8

10

maxnodes

mtry

Figure 7.3 – Improvement of the random forest predictors versus the Null model.

mtry = 6 and maxnodes = 120. We clearly see that mtry = 2 is too small, however from Figure

7.3 we do not see a large di↵erence in the performance when varying mtry 2 {4, 6, 8, 10}. Note that


our problem is still rather low-dimensional with only 11 covariates. The benefit due to the choice of

mtry may increase when considering a larger set of explanatory variables.

In order to compare the random forest method to the other predictors, Table 7.3 shows the out

of sample performance for the optimal random forest predictor, the optimal bagged tree predictor

(see Section 6.3.3), GLM and Null model. In Figure 7.4 the improvements versus the Null model are

shown. We conclude that the random forest predictor has a similar predictive capacity, however it

seems that there is a larger variability between the years.


Year 2011 2012 2013 2014

rf (maxnodes = 120, mtry = 8) 0.7179291 0.6829232 0.7698701 0.7738687Bagging 0.7174013 0.6843935 0.7678068 0.7715305GLM 0.7172632 0.6843318 0.7688065 0.7715159Null 0.7290833 0.6957688 0.7808736 0.7842622

Table 7.3 – Out of sample performance for random forest predictors based on the data set restricted tofull durations only. The best method for each year is highlighted in bold face.

●

●

●

●0.01

050.

0115

0.01

25

Year

Impr

ovem

ent v

ersu

s N

ull M

odel

2011 2012 2013 2014

●

●

●

●

●

●

●

●

randomForest (maxnodes = 120, mtry = 6)Bagging (maxnodes = 80, large)GLM

Figure 7.4 – Improvements of the mean out of sample deviance versus the Null model for the multivariateclaims frequency analysis.

7.3.2 Visualization

In the previous section, we have fit a random forest predictor to the data restricted to observations

with a full duration. The obtained predictor is able to compete with the bagged tree predictor as

well as the GLM model in terms of mean out of sample deviance. However both due to reasons

explained in Section 6.3.4 but also in order to understand the variability observed in Figure 7.4, we

7.4. SUMMARY 115

show the partial dependence resp. individual conditional expectation plots for the random forest

predictor. Figure 7.5 shows the numerical covariates and Figure 7.6 the categorical covariates for the

predictor rf(maxnodes = 120, mtry = 8). Looking at the estimated partial dependence function, we

can conclude that the random forest predictor is also able to capture the characteristics of the Motor

Insurance Collision Data Set. However the individual conditional expectation curves reveal that for

same covariate levels the random forest predictor fits unusual high frequencies, e.g. for young drivers

the frequencies go up to 60%. It is likely that these large values lead to the variability observed in

Figure 7.4. Hence restricting the data to full durations does not lead to a random forest predictor

which can compete with the bagged tree predictor.

Figure 7.5 – Individual conditional expectation plots for numerical covariates.

7.4 Summary

In the present chapter, we applied the random forest algorithm to insurance claims count data. Using

simulated data, we showed that the tuning parameter maxnodes for the random forest algorithm

can be chosen similar to the maxnodes parameter of the bagged tree predictor. For choosing the

parameter mtry, we used a grid search procedure to calibrate the model. We only observed small

di↵erences when varying mtry on our data set. With respect to modeling insurance claims data,

the implementation does unfortunately not provide a meaningful way to include the underlying

volumes of the observations into the model. The random forest implementation in the R package

randomForest is based on minimizing the mean squared error without an option to modify the

7.4. SUMMARY 116

Figure 7.6 – Individual conditional expectation plots for categorical covariates.

optimization criterion. For that reason, when applying the method to a real data set, such as the

Motor Insurance Collision Data Set, the predicted values are highly distorted by the large frequencies

due to small durations. A possible solution to this issue is to restrict the data to observations with a

full duration of one year at risk only. Although the predictive capacity measured by the out of sample

deviance was in the range of a GLM model and the bagged tree predictor, the ICE plots revealed

that the method predicts unusual large values for certain covariates values. Note that the benefits

of random forest may only fully unfold if a larger number of explanatory variables is considered.

Our multivariate analysis only contained 11 covariates for which the improvements of using random

forest were not su�ciently large to overcome the drawback of using the mean squared error instead

of the Poisson deviance statistics for growing an individual random forest tree.

Chapter 8

Summary

In this master thesis we have analyzed tree-based algorithms to predict the expected claims fre-

quency of a policyholder of a non-life insurance product. We have looked at three di↵erent methods:

regression tree, bagging and random forest. From the basics of non-life insurance pricing, we observe

that is it crucial to take the underlying volume into account. Therefore, we introduced the Poisson

deviance statistics as a measure of goodness. Conceptually, the tree algorithm can be easily adap-

ted to other performance measures. Fortunately, there already exists an e�cient implementation

within the rpart package for growing regression trees based on the deviance statistics derived from

the Poisson distribution. This extension showed to be a reasonable approach to the problem of

modeling claims frequencies. In order to deal with the degenerate case, the rpart implementation

assumes a Poisson-Gamma model. However the calibration of both the prior mean and the shrinkage

parameter is not optimal from a credibility point of view. Furthermore, additional flexibility of the

implementation would be desirable. In general the package does not provide detailed documentation

and hence makes it less attractive to many potential users. Sections 3.2.1 and 5.1 provide a detailed

introduction to Poisson regression trees with respect to the rpart implementation. In the empirical

analysis, univariate regression trees provided an analytical tool to form subportfolios with similar

risk characteristics. However multivariate regression trees did not provide competitive alternatives

to GLM models in terms of out of sample performance. In order to improve the performance, we

have analyzed the variance reduction technique bagging. Although no decomposition of the Pois-

son deviance statistics into bias and variance is known, we have observed a significant performance

improvement both on simulated and real world data. Overall the bagged tree predictor provides

a competitive alternative to a GLM model with the advantage that the underlying structure does

not need to be provided in advanced. In order to further improve the bagged tree predictor, we

then evaluated the performance of the random forest algorithm, which is based on a further variance

reduction by reducing the correlation between the individual trees within the ensemble. However, so

far there does not exist an implementation using the Poisson deviance statistics rather then the mean

squared error for growing the individual trees. As a consequence, we were not able to meaningfully

117

8.1. FUTURE WORK 118

include the volume measure into the model and therefore random forest did not provide a competit-

ive alternative. Since bagging and random forest are black box models, we considered several model

visualization techniques. The results supported the fact that the bagged tree predictor is able to

recover all characteristics of an insurance claims data set.

8.1 Future Work

This thesis put a strong focus on the empirical analysis of tree-based methods in the context of insur-

ance claims data based on the theoretical understanding described in the literature. The structure

of the data lead to a modification of the classical algorithm to the Poisson deviance statistics rather

than the mean squared error. However the theoretical results on performance improvement due to

bagging and random forest are crucially based on the use of the mean squared error. Nonetheless,

when using the Poisson deviance statistics we have observed a significant improvement. Further

research may be able to establish a theoretical foundation for our findings. From an implementation

perspective, further improvements of the bagged tree predictor may be possible when implementing

the random forest algorithm based on the Poisson deviance statistics as a measure of performance

for growing the individual trees.

Regarding the regression tree method, the Poisson option assumes a Poisson-Gamma model to

avoid the problem of degeneracy. However further research is needed on how the Poisson-Gamma

model needs to be calibrated in a tree model for insurance claims data.

Furthermore, we have seen that the black box procedure bagging was able to compete with a

GLM model in terms of predictive capacity. Therefore, one may also consider di↵erent methods such

as neural networks or support vector machines for further research. However our work suggests that

being able to include the duration is a crucial component of a reasonable frequency model.

Finally, as described by equation (4.1), there are two components to the pricing of a non-life

insurance product: The expected claims frequency as well as the expected claim severity. This

work only focused on the frequency aspects. However we expect that large parts of the results are

also adaptable to the severity issue. A measure of performance may be derived from the Gamma

distribution, which is also commonly used within GLM models for the claim severity.

Appendices

Appendix A

Proofs

A.1 Proof of Equation (3.2)

Proof. We need to find cm

such that

cm

= argminc

m

2R

X

{i:xi

2Rm

}

(yi

� cm

)2.

Di↵erentiation with respect to cm

and evaluation at zero gives,

@

@cm

X

{i:xi

2Rm

}

(yi

� cm

)2 = (�2)X

{i:xi

2Rm

}

(yi

� cm

) = 0.

Solving for cm

ends the proof.

A.2 Proof of Proposition 3.7

Proof. Let T be a tree having n leaves with root T0 and primary branches TL

and TR

(see Figure

A.1). We use the following notation T = T0+TL

+TR

. Assuming existence of T (↵), then uniqueness

follows by choosing the smallest one within a finite set of possible subtrees. The proof of existence

follows by induction. The assertion is clear for |T | = 1. Suppose it is true for all trees having fewer

than n � 2 leaves. Note that

C↵

(T (↵)) = min{C↵

(T0), C↵

(TL

(↵)) + C↵

(TR

(↵))},

where TL

(↵) and TR

(↵) are the smallest optimal subtrees of TL

resp. TR

. Since either C↵

(T0) C↵

(TL

(↵))+C↵

(TR

(↵)), then T (↵) = T0 or otherwise T (↵) = T0+TL

(↵)+TR

(↵). However we have

|TL

| < n and |TR

| < n, hence by hypothesis TL

(↵) and TR

(↵) are unique and exist. It remains to

show that for ↵1 ↵2, T (↵2) ⇢ T (↵1). We again show this by induction. Since for larger ↵, we

120

A.3. PROOF OF EQUATION (3.7) 121

T0

TL TR

Figure A.1 – Illustration of the primary branches TL and TR of T .

prune towards smaller trees, it follows that if |T (↵1)| = 1 then also |T (↵2)| = 1. Again assuming

that the statement holds true for all trees having fewer than n � 2 leaves. Using the same argument

as above, either T (↵2) = T0, hence we are done, or T (↵2) = T0 + TL

(↵2) + TR

(↵2). Since both TL

and TR

have fewer than n leaves, it holds by hypothesis that TL

(↵2) ⇢ TL

(↵1) and TR

(↵2) ⇢ TR

(↵1).

This implies that T (↵2) ⇢ T (↵1).


Proof. In order to calculate the deviance statistics for y = (N1, . . . , Nn

), with independent Ni

=

N |X=x

i

⇠ Poi(�(xi

)vi

), we need to express the log-likelihood as a function of the mean vector

µ = (E(N1), . . . ,E(Nn

)) = (�(x1)v1, . . . ,�(xn

)vn

). It follows that

`(µ) =nX

i=1

log

✓(�(x

i

)vi

)Ni

Ni

!exp(��(x

i

)vi

)

◆

=nX

i=1

Ni

log(�(xi

)vi

)� log(Ni

!)� �(xi

)vi

.

Therefore the deviance statistics is given by

D(y,µ) = 2(`(y)� `(µ))

= 2nX

i=1

(Ni

log(Ni

)� log(Ni

!)�Ni

�Ni

log(�(xi

)vi

) + log(Ni

!) + �(xi

)vi

)

A.4. PROOF OF EQUATION (3.12) 122

= 2nX

i=1

Ni

log

✓N

i

�(xi

)vi

◆� (N

i

� �(xi

)vi

).


Proof. We need to find cm

such that

cm

= argminc

m

2R

X

{i:xi

2Rm

}

wi

(yi

� cm

)2.

Di↵erentiation with respect to cm

and evaluation at zero gives,

@

@cm

X

{i:xi

2Rm

}

wi

(yi

� cm

)2 = (�2)X

{i:xi

2Rm

}

wi

(yi

� cm

) = 0.

Solving for cm

ends the proof.

A.5 Proof of Remark 4.6

Proof. Recall that N i =P

m

i

j=1Ni

j

and vi =P

m

i

j=1 vj . Then it follows that

P

N i

vi2"�i

� �i

V co

N i

vi

!,�

i

+ �i

V co

✓N i

vi

◆#!

= P

✓N i

vi �

i

+ �i

V co

✓N i

vi

◆◆� P

✓N i

vi �

i

� �i

V co

✓N i

vi

◆◆

= P

✓N i � vi�

i

�i

V co(N i) 1

◆� P

✓N i � vi�

i

�i

V co(N i) �1

◆

= P

0

BB@

Pm

j

j=1Ni

j

� E⇣P

m

j

j=1Ni

j

⌘

rV ar

⇣Pm

j

j=1Ni

j

⌘ 1

1

CCA� P

0

BB@

Pm

j

j=1Ni

j

� E⇣P

m

j

j=1Ni

j

⌘

rV ar

⇣Pm

j

j=1Ni

j

⌘ �1

1

CCA

⇡ �(1)� �(�1) ⇡ 70%,

where we used the central limit theorem in the fourth step.

A.6. PROOF OF LEMMA 5.3 123

A.6 Proof of Lemma 5.3

Proof. Recall for N i

j

⇠ NegBin(�i

vj

, �) we have that E(N i

j

) = �i

vj

and V ar(N i

j

) = �i

vj

(1+�i

vj

/�).

Then the expression for the coe�cient of variation in Lemma 5.3 follows by a direct calculation,

V co

✓N i

vi

◆=

vuutV ar

Pm

j

j=1Ni

jPm

i

j=1 vj

!�E P

m

j

j=1Ni

jPm

i

j=1 vj

!

=

vuutm

jX

j=1

V ar⇣N i

j

⌘� m

jX

j=1

E�N i

j

�

=

vuuut1

�i

Pm

i

j=1 vj+

1

�

Pm

i

j=1(�i

vj

)2⇣�i

Pm

i

j=1 vj

⌘2

=

vuuut1

�i

Pm

i

j=1 vj+

1

�

Pm

i

j=1 v2j⇣P

m

i

j=1 vj

⌘2 , (A.1)

where we used the independence of the claims counts in the second equality. Next we show that

Pm

i

j=1 v2j⇣P

m

i

j=1 vj

⌘2m

i

!1! 0.

Since there exists � > 0, M < 1 such that 0 < � vj

M < 1, the have that

0 P

m

i

j=1 v2j⇣P

m

i

j=1 vj

⌘2 P

m

i

j=1M2

⇣Pm

i

j=1 �⌘2

=m

i

M2

m2i

�2! 0 for m

i

! 1

Together withP

m

i

j=1 vjm

i

!1! 1, we get that equation (A.1) converges to 0.

Appendix B

Correlation Plot for Additional

Covariates

●

●

● ●

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Horse Power

Price

Weight

Age

Policy Age

0.89

0.63

−0.04

−0.09

0.59

−0.02

−0.09

−0.03

−0.07 0.58

Figure B.1 – Correlations between original and additionally included covariates.

124

Appendix C

R Code

C.1 Variable Importance

varImp <- function(object, ...){tmp <- rownames(object$splits)

allVars <- colnames(attributes(object$terms)$factors)

if(is.null(tmp)){out<-NULL

zeros <- data.frame(x = rep(0, length(allVars)),

Variable = allVars)

out <- rbind(out, zeros)

}else{rownames(object$splits) <- 1:nrow(object$splits)

splits <- data.frame(object$splits)

splits$var <- tmp

splits$type <- ""

frame <- as.data.frame(object$frame)

index <- 0

for(i in 1:nrow(frame)){if(frame$var[i] != "<leaf>"){

index <- index + 1

splits$type[index] <- "primary"

if(frame$ncompete[i] > 0){

125

C.1. VARIABLE IMPORTANCE 126

for(j in 1:frame$ncompete[i]){index <- index + 1

splits$type[index] <- "competing"

}}if(frame$nsurrogate[i] > 0){for(j in 1:frame$nsurrogate[i]){

index <- index + 1

splits$type[index] <- "surrogate"

}}

}}splits$var <- factor(as.character(splits$var))

splits <- subset(splits, type != "competing")

out <- aggregate(splits$improve,

list(Variable = splits$var),

sum,

na.rm = TRUE)

allVars <- colnames(attributes(object$terms)$factors)

if(!all(allVars %in% out$Variable)){missingVars <- allVars[!(allVars %in% out$Variable)]

zeros <- data.frame(x = rep(0, length(missingVars)),

Variable = missingVars)

out <- rbind(out, zeros)

}}out2 <- data.frame(Overall = out$x)

or <- order(out2$Overall, decreasing = T)

out3 <- data.frame(Overall = out2[or,])

rownames(out3) <- out$Variable[or]

out3

}

C.2. STRATIFIED CROSS-VALIDATION 127

C.2 Stratified Cross-Validation

data <- data[order(data$freq),] #Order the data first. Then we do not need

#to switch between different orderings

tree <- rpart(cbind(jr,nc) ~ age+price, data=data, method="poisson",

parms = list(shrink = 3),

control=rpart.control(minbucket=4000, cp=0.00005))

cv_customized <- function(K = 10, tree, data){stopifnot(is.unsorted(data$freq) == F) #check if data is sorted

set.seed(2)

xgroup <- rep(1:K, length = nrow(data))

xfit <- xpred.rpart(tree, xgroup)

n_subtrees <- dim(tree$cptable)[1]

std <- numeric(n_subtrees)

err <- numeric(n_subtrees)

err_group <- numeric(K)

for(i in 1:n_subtrees){for(k in 1:K){ind_group <- which(xgroup == k)

err_group[k] <- 2*sum(log((data[ind_group, "nc"]/

(xfit[ind_group,i]*data[ind_group, "jr"]))^

data[ind_group, "nc"])-

(data[ind_group, "nc"]-xfit[ind_group,i]*

data[ind_group, "jr"]))

}std[i] <- sd(err_group)/sqrt(K)

err[i] <- mean(err_group)

}return(list("err"=err, "std" = std))

}

C.3. SIGNIFICANCE TEST PRUNING 128

C.3 Significance Test Pruning

test.prune <- function(tree, data.test){pred.test <- predict(tree, newdata = data.test)

pred.class <- factor(pred.test)

ind_leaves <- which(tree$frame$var == '<leaf>')

leaf_nr <- as.numeric(rownames(tree$frame)[ind_leaves])

#Identifying relevant leaves for pruning:

left <- which(leaf_nr %% 2 == 0)

leaf_nr_left <- leaf_nr[left]

leaf_nr_left <- leaf_nr_left[which(leaf_nr[left+1]-leaf_nr[left] == 1)]

leaf_nr_right <- leaf_nr_left+1

bool_prune <- numeric(length(leaf_nr_right))

for(i in 1:length(leaf_nr_left)){yval_left <- tree$frame$yval[which(rownames(tree$frame) ==

as.character(leaf_nr_left[i]))]

yval_right <- tree$frame$yval[which(rownames(tree$frame) ==

as.character(leaf_nr_right[i]))]

ind_left <- which(pred.test == yval_left)

ind_right <- which(pred.test == yval_right)

test <- wilcox.test(data.test$nc[ind_left],

data.test$nc[ind_right],

alternative = 'two.sided',

paired = FALSE, exact = T)

bool_prune[i] <- test$p.value > 0.05

}

nodes_prune <- rownames(tree$frame)[which(rownames(tree$frame) %in%

leaf_nr_left[which(bool_prune

== 1)])-1]

tree_prune <- tree

if(length(nodes_prune) >= 1){for(k in 1:length(nodes_prune)){

C.3. SIGNIFICANCE TEST PRUNING 129

tree_prune <- snip.rpart(tree_prune,

toss = as.numeric(nodes_prune[k]))

}}tree <- tree_prune

cat('Number of pruned leaves: ', sum(bool_prune), '\n')return(list('tree' =tree, 'n_prune' = sum(bool_prune)))

}

#Minimal Example for the use of the test.prune function,

#based on drawing bootstrap samples from the training data set.

tree <- rpart(cbind(jr, nc) ~ age + price, data = data, method="poisson",

parms = list(shrink = 3.5),

control=rpart.control(minbucket=1000, cp=0.00001))

set.seed(222)

tree.stat <- tree

B <- 0

while(B <10){data.test <- data[sample.int(n = nrow(data), size = nrow(data),

replace = T),]

prune <- test.prune(tree.stat, data.test = data.test)

tree.stat <- prune[[1]]

if(prune[[2]] == 0) B <- B+1

if(prune[[2]] > 0) B <- 0

}

C.4. BAGGING 130

C.4 Bagging

bagging <- function(formula, data, maxnodes, ntree, ...){require(rpart)

splits <- maxnodes-1

n <- nrow(data)

trees <- list()

for(b in 1:ntree){cat('Number of bagged Tree: ', b, '\n')b.ind <- sample.int(n, size = n, replace = T)

data.b <- data[b.ind,]

tmp <- rpart(formula, data = data.b, ...)

cp <- tmp$cptable[tail(which(tmp$cptable[, "nsplit"] <= splits),

n=1), "CP"]

cat('Number of Leaves: ',

tmp$cptable[tail(which(tmp$cptable[, "nsplit"] <= splits),

n=1), "nsplit"]+1, '\n')trees[[b]] <- prune(tmp, cp = cp)

}return(trees)

}

Appendix D

Variable Description

Variable Name ............................. Type ............................. Description

ac Numerical Claim amount

age Numerical Age of driver

beg Date Start date of record

bonus Categorical Bonus protection

car age Numerical Years since first licensing of car

car class Categorical Car classification

deductible Numerical Deductible of policy

end Date End date of record

kg Numerical Weight of car

km Numerical Kilometres yearly driven

leasing Categorical 0: not leasing, 1: leasing

nat Categorical Nationality

nc Numerical Number of claims

o�ce Categorical Regional o�ce to which policy is assigned

plz Categorical Zip code of place of residence of driver

pol age Numerical Age of policy

pol Categorical Identifier for every policyholder at AXA

price Numerical Price of car

ps Numerical Horse power of car

region Categorical Region classification used by AXA

sex Categorical Sex of driver

yr Numerical Duration (years at risk)

Table D.1 – List of variables used in this thesis.

131

Bibliography

R. A. Berk. Statistical Learning from a Regression Perspective. Springer Series in Statistics. Springer

New York, 2008.

P. Buhlmann and M. Machler. Lecture Notes on Computational Statistics. Seminar fur Statistik. ETH

Zurich, 2014. URL https://stat.ethz.ch/education/semesters/ss2015/CompStat/sk.pdf.

L. Breiman. Bagging predictors. Machine Learning, 24(2):123–140, 1996.

L. Breiman. Random forest. Machine Learning, 45(1):5–32, 2001.

L. Breiman, J. H. Friedman, R. A. Olshen, and C. J. Stone. Classification and Regression Trees.

Wadsworth, 1984.

H. Buhlmann and A. Gisler. A Course in Credibility Theory and its Applications. Universitext.

Springer Berlin Heidelberg, 2005.

B. Efron. Bootstrap methods: Another look at the jackknife. Ann. Statist., 7(1):1–26, 1979.

A. Goldstein, A. Kapelner, J. Bleich, and E. Pitkin. Peeking inside the black box: Visualizing

statistical learning with plots of individual conditional expectation. 2014. URL http://arxiv.

org/abs/1309.6392.

T. Hastie, R. Tibshirani, and J. Friedman. The Elements of Statistical Learning. Springer Series in

Statistics. Springer New York, 2001.

G. James, D. Witten, T. Hastie, and R. Tibshirani. An Introduction to Statistical Learning: with

Applications in R. Springer Texts in Statistics. Springer New York, 2014.

R. Kohavi. A study of cross-validation and bootstrap for accuracy estimation and model selection.

Proceedings of the Fourteenth International Joint Conference on Artificial Intelligence, pages 1137–

1143, 1995.

M. LeBlanc and J. Crowley. Relative risk trees for censored survival data. Biometrics, 48(2):411–425,

1991.

132

https://stat.ethz.ch/education/semesters/ss2015/CompStat/sk.pdf

http://arxiv.org/abs/1309.6392

http://arxiv.org/abs/1309.6392

BIBLIOGRAPHY 133

A. Negassa, A. Ciampi, M. Abrahamowicz, S. Shapiro, and J.-F. Boivin. Tree-structured prognostic

classification for censored surivial data: validation of computationally inexpensive model selection

criteria. Journal of Statistical Computation and Simulation, 67(4):289–317, 2000.

E. Ohlsson and B. Johansson. Non-Life Insurance Pricing with Generalized Linear Models. EAA

Series. Springer Berlin Heidelberg, 2010.

C. Shalizi. Lecture notes on data mining. 2009. URL http://www.stat.cmu.edu/~cshalizi/350.

T. Therneau, B. Atkinson, and B. Ripley. rpart: Recursive Partitioning and Regression Trees, 2015.

URL http://CRAN.R-project.org/package=rpart. R package version 4.1-9.

M. V. Wuthrich. Non-Life Insurance: Mathematics & Statistics. SSRN, 2015. URL http://ssrn.

com/abstract=2319328.

http://www.stat.cmu.edu/~cshalizi/350

http://CRAN.R-project.org/package=rpart

http://ssrn.com/abstract=2319328

http://ssrn.com/abstract=2319328

data science in non-life insurance pricing · predicting claims frequencies using tree-based models...

Documents