modelling the expected loss of bodily injury claims using gradient boosting

Modelling the expected loss of bodily injury claims using

gradient boosting

The following is an extract from a report on modelling the expected loss of bodily

injury claims using gradient boosting.

Summary

➢ Modelling the expected loss:

o Frequency model (classification)*

o Severity model (regression)

➢ The expected loss is given by multiplying the frequency and severity under the

assumption that they are independent**.

➢ Not so fast:

o The frequency model needs to take time into account

o Calculate a hazard rate

➢ Emphasis is on predictive accuracy

* Classification was used due to the nature of the policy and data – there could only be a claim (1) or no claim (0)

over the period examined. It was therefore a case of modeling the probability of a claim at an individual policy

holder level over the time period.

**There are important technical considerations here which are beyond the scope of this initial project. Various

approaches can and need to be considered when modelling the aggregate expected loss, and these approaches

can impact the modelling of the frequency and severity models. Such as the use of scenario analysis which

requires specification of the functional form of the frequency and severity models. One suggested approach is to

use Poisson frequency and Gamma distributed severity for individual claims so that the expected loss follows a

Tweedie compound Poisson distribution. See (Yang, Qian, Zou, 2014).

by Gregg Barrett

Overview of the modelling effort for the bodily injury claims data

A critical challenge in insurance is setting the premium for the policyholder. In a competitive market

an insurer needs to accurately price the expected loss of the policyholder. Failing to do so places the

insurer at the risk of adverse selection. Adverse selection is where the insurer loses profitable policies

and retains loss incurring policies resulting in economic loss. In personal car insurance for example,

this could occur if the insurer charged the same premium for old and young drivers. If the expected

loss for old drivers was significantly lower than that of the young drivers, the old drivers can be

expected to switch to a competitor leaving the insurer with a portfolio of young drivers (whom are

under-priced) incurring an economic loss.

In this project we have undertaken an attempt to accurately predict the expected loss for the

policyholder concerning a bodily injury claim. In doing so it is necessary to break down the process

into two distinct components: claim frequency and claim severity. For convenience and simplicity, we

have chosen to model the frequency and severity separately.

Other inputs into the premium setting (rating process) such as administrative costs, loadings, cost of

capital etc. have been omitted as we are only concerned with modelling the expected loss.

In modelling the claim frequency, a classification model will be used to model the probability of a bodily injury claim given set of features that cover mostly vehicle characteristics. The actual claim frequency for the dataset used in this project is around 1%. In modelling the claim severity, a regression model will be used to model the expected claim amount

again using a set of features that cover mostly vehicle characteristics.

To ensure that the estimated performance of the model, as measured on the test sample, is an

accurate approximation of the expected performance on future ‘‘unseen’’ cases, the inception date

of policies in the test set is posterior to the policies used to train the model.

That dataset which covers the period from 2005 through 2007 was therefore split into three groups:

2005 through 2006 – Training set

2005 through 2006 – Validation set

2007 – Test set

An adjustment to the output of the claim frequency model is necessary in order to derive a probability

on an annual basis. This is due to the claim frequency being calculated over a period of two years

(2005 through 2006). For this project we assumed an exponential hazard function and adjusted the

claims frequency as follows:

P(t) = 1 - exp(-λ, T) where:

P(t) = the annual probability of a claim

T = 1/2

λ = the probability of a claim predicted by the claim frequency model

In this project model validation is measured by the degree of predictive accuracy and this objective is

emphasized over producing interpretable models. The lack of interpretability in most algorithmic

models, appears to be a reason that their application to insurance pricing problems has been very

limited so far. (Guelman, 2011).

In modelling the claim frequency, a ROC (Receiver Operator Characteristics) curve will be used to

assess model performance measuring the AUC (Area Under the Curve). In modelling the claim severity,

the RMSE (Root Mean Squared Error) will be used to assess model performance.

The test data was not used for model selection purposes, but purely to assess the generalization error of the final chosen model. Assessing this error is broken down into three components:

1) Assessing the performance of the classification model on the test data using the AUC score. 2) Assessing the performance of the regression model on the test data using the RMSE. 3) Assessing the performance in predicting the expected loss by comparing the predicted

expected loss for the 2007 portfolio of policyholders against the realised loss for the 2007 portfolio of policyholders.

Gradient Boosting, often referred to as simply “boosting”, was selected as the modelling approach for this project. Boosting is a general approach that can be applied to many statistical learning methods for regression or classification. Boosting is a supervised, non-parametric machine learning approach. (Geurts, Irrthum, Wehenkel, 2009). Supervised learning refers to the subset of machine learning methods which derive models in the form of input-output relationships. More precisely, the goal of supervised learning is to identify a mapping from some input variables to some output variables on the sole basis of a given sample of joint observations of the values of these variables. Non-parametric means that we do not make explicit assumptions about the functional form of f. Where the intent is

to find a function 𝑓such that Y ≈ 𝑓(X) for any observation (X, Y). With boosting methods optimisation is held out in the function space. That is, we parameterise the

function estimate 𝑓in the additive functional form:

In this representation:

𝑓0 is the initial guess M is the number of iterations

(𝑓𝑖)𝑖=1

𝑀 are the function increments, also referred to as the “boosts”

It is useful to distinguish between the parameterisation of the “base-learner” function and the overall

ensemble function estimate 𝑓(𝑥) known as the “loss function”. Boosted models can be implemented with different base-learner functions. Common base-learner functions include; linear models, smooth models, decision trees, and custom base-learner functions. Several classes of base-learner models can be implemented in one boosted model. This means that the same functional formula can include both smooth additive components and the decision trees components at the same time. (Natekin, Knoll, 2013). Loss functions can be classified according to the type of outcome variable, Y. For regression problems Gaussian (minimizing squared error), Laplace (minimizing absolute error), and Huber are considerations,

while Bernoulli or Adaboost are consideration for classification. There are also loss functions for survival models and count data. This flexibility makes the boosting highly customizable to any particular data-driven task. It introduces a lot of freedom into the model design thus making the choice of the most appropriate loss function a matter of trial and error. (Natekin, Knoll, 2013). To provide an intuitive explanation we will use an example of boosting in the context of decision trees as was used in this project. Unlike fitting a single large decision tree to the training data, which amounts to fitting the data hard and potentially overfitting, the boosting approach instead learns slowly. Given an initial model (decision tree), we fit a decision tree (the base-learner) to the residuals from the initial model. That is, we fit a tree using the current residuals, rather than the outcome Y. We then add this new decision tree into the fitted function in order to update the residuals. The process is conducted sequentially so that at each particular iteration, a new weak, base-learner model is trained with respect to the error of the whole ensemble learnt so far. With such an approach the model structure is thus learned from data and not predetermined, thereby avoiding an explicit model specification, and incorporating complex and higher order interactions to reduce the potential modelling bias. (Yang, Qian, Zou, 2014).

The first choice in building the model involves selecting an appropriate loss function. Squared-error loss was selected to define prediction error for the severity model and Bernoulli deviance was selected for the frequency model. There are three tuning parameters that need to be set:

- Shrinkage (the learning rate) - Number of trees (the number of iterations) - Depth (interaction depth)

The shrinkage parameter sets the learning rate of the base-learner models. In general, statistical learning approaches that learn slowly tend to perform well. In boosting the construction of each tree depends on the trees that have already been grown. Typical values are 0.01 or 0.001, and the right choice can depend on the problem. (Ridgeway, 2012). It is important to know that smaller values of shrinkage (almost) always give improved predictive performance. However, there are computational costs, both storage and CPU time, associated with setting shrinkage to be low. The model with shrinkage=0.001 will likely require ten times as many trees as the model with shrinkage=0.01, increasing storage and computation time by a factor of 10. It is generally the case that for small shrinkage parameters, 0.001 for example, there is a fairly long plateau in which predictive performance is at its best. A recommended rule of thumb is to set shrinkage as small as possible while still being able to fit the model in a reasonable amount of time and storage. (Ridgeway, 2012). Boosting can overfit if the number of trees is too large, although this overfitting tends to occur slowly if at all. (James, Witten, Hastie, Tibshirani, 2013). Cross-validation and information criterion can be used to select the number of trees. Again it is worth stressing that the optimal number of trees and the shrinkage (learning rate) depend on each other, although slower learning rates do not necessarily scale the number of optimal trees. That is, when shrinkage = 0.1 and the optimal number of tress = 100, does not necessarily imply that when shrinkage = 0.01 the optimal number of trees = 1000. (Ridgeway, 2012).

Depth sets the number of splits in each tree, which controls the complexity of the boosted ensemble. When depth = 1 each tree is a stump, consisting of a single split. In this case, the boosted ensemble is fitting an additive model, since each term involves only a single variable. More generally depth is the interaction depth, and controls the interaction order of the boosted model, since d splits can involve at most d variables. (James, Witten, Hastie, Tibshirani, 2013). A strength of tree based methods is that single depth tress are readily understandable and interpretable. In addition, decision trees have the ability to select or rank the attributes according to their relevance for predicting the output, a feature that is shared with almost no other non-parametric methods. (Geurts, Irrthum, Wehenkel, 2009). From the point of view of their statistical properties, tree-based methods are non-parametric universal approximators, meaning that, with sufficient complexity, a tree can represent any continuous function with an arbitrary high precision. When used with numerical attributes, they are invariant with respect to monotone transformations of the input attributes. (Geurts, Irrthum, Wehenkel, 2009). Importantly boosted decision trees require very little data pre-processing, which can easily be one of the most time consuming activities in a project of this nature. As boosted decision trees handle the predictor and response variables of any type without the need for transformation, and are insensitive to outliers and missing values, it is natural choice not only for this project but for insurance in general where there are frequently a large number of categorical and numerical predictors, non-linearities and complex interactions, as well as missing values that all need to be modelled. Lastly, the techniques used in this project can be applied independently of the limitations imposed by

any specific legislation.

Potential Improvements

Below are several suggestions for improving the initial model.

Specification A careful specification of the loss function leads to the estimation of any desired characteristic of the

conditional distribution of the response. This coupled with the large number of base learners

guarantees a rich set of models that can be addressed by boosting. (Hofner, Mayr, Robinzonovz,

Schmid, 2014)

AUC loss function for the classification model

For the classification model AUC can be tested as a loss function to optimize the area under the ROC

curve.

Huber loss function for the regression model

The Huber loss function can be used as a robust alternative to the L2 (least squares error) loss.

Where:

𝜌 is the loss function

δ is the parameter that limits the outliers which are subject to absolute error loss

Quantile loss function for the regression model:

Another alternative for settings with continuous response is modeling conditional quantiles through

quantile regression (Koenker 2005). The main advantage of quantile regression is (beyond its

robustness towards outliers) that it does not rely on any distributional assumptions on the response

or the error terms. (Hofner, Mayr, Robinzonovz, Schmid, 2014)

Laplace loss function for the regression model:

The Laplace loss function is the function of choice if we are interested in the median of the conditional

distribution. It implements a distribution free, median regression approach especially useful for long-

tailed error distributions.

The loss function allows flexible specification of the link between the response and the covariates. The figure on

the left hand side illustrates the L2 loss, the figure on the right hand side illustrates the L1 (least absolute

deviation) loss function.

All of the above listed loss functions can be implemented within the “mboost” package. The table

below provides an overview of some of the currently available loss functions within the mboost

package.

An overview on the currently implemented families in mboost.

Optimal number of iterations using AIC:

To maximise predictive power and to prevent overfitting it is important that the optimal stopping

iteration is carefully chosen. Various possibilities to determine the stopping iteration exist. AIC was

considered however this is usually not recommended as AIC-based stopping tends to overshoot the

optimal stopping dramatically. (Hofner, Mayr, Robinzonovz, Schmid, 2014)

Package Package xgboost

The package “xgboost” was also tested during this project. It’s benefit of over the gbm and mboost

package is that it is purportedly faster. It should be noted that xgboost requires the data to be in the

form of numeric vectors and thus necessitates some additional data preparation. It was also found to

be a little more challenging to implement as opposed to the gbm and mboost packages.

Reference

Geurts, P., Irrthum, A., Wehenkel, L. (2009). Supervised learning with decision tree-based methods in computational and systems biology. [pdf]. Retrieved from

http://www.montefiore.ulg.ac.be/~geurts/Papers/geurts09-molecularbiosystems.pdf Guelman, L. (2011). Gradient boosting trees for auto insurance loss cost modeling and prediction.

[pdf]. Retrieved from http://www.sciencedirect.com/science/article/pii/S0957417411013674

Hofner, B., Mayr, A., Robinzonovz, N., Schmid, M. (2014). Model-based boosting in r: a hands-on

tutorial using the r package mboost. [pdf]. Retrieved from https://cran.r- project.org/web/packages/mboost/vignettes/mboost_tutorial.pdf

Hofner, B., Mayr, A., Robinzonovz, N., Schmid, M. (2014). An overview on the currently implemented families

in mboost. [table]. Retrieved from Hofner, B., Mayr, A., Robinzonovz, N., Schmid, M. (2014). Model-based boosting in r: a hands-on tutorial using the r package mboost. [pdf]. Retrieved from https://cran.r-project.org/web/packages/mboost/vignettes/mboost_tutorial.pdf

James, G., Witten, D., Hastie, T., Tibshirani, R. (2013). Introduction to statistical learning with applications in R.

[ebook]. Retrieved from http://www-bcf.usc.edu/~gareth/ISL/getbook.html

Natekin, A., Knoll, A. (2013). Gradient boosting machines tutorial. [pdf]. Retrieved from http://www.ncbi.nlm.nih.gov/pmc/articles/PMC3885826/

Ridgeway, G. (2012). Generalized boosted models: a guide to the gbm package. [pdf]. Retrieved from https://cran.r-project.org/web/packages/gbm/gbm.pdf

Yang, Y., Qian, W., Zou, H. (2014). A boosted nonparametric tweedie model for insurance premium. [pdf]. Retrieved from https://people.rit.edu/wxqsma/papers/paper4

modelling the expected loss of bodily injury claims using gradient boosting

Data & Analytics