from generalized linear models to neural networks, and back · from generalized linear models to...

From Generalized Linear Models to Neural Networks,and Back

Mario V. Wuthrich

RiskLab, ETH Zurich

April 22, 2020

One World Actuarial Research Seminar

References

• From generalized linear models to neural networks, and backSSRN Manuscript 3491790, March 2020

https://papers.ssrn.com/sol3/papers.cfm?abstract id=3491790

B This topic originates from the seminal paper:

• Generalized linear modelsNelder, J.A., Wedderburn, R.W.M. (1972)Journal of the Royal Statistical Society, Series A (General) 135/3, 370-384

B For more (historical) references: see our SSRN Manuscript.

1

The modeling cycle

(1) data collection, data cleaning and data pre-processing (≥ 80% of total time)

(2) selection of model class (data or algorithmic modeling culture, Breiman 2001)

(3) choice of objective function

(4) ’solving’ a (non-convex) optimization problem

(5) model validation

(6) possibly go back to (1)

B ’solving’ involves:

choice of algorithmchoice of stopping criterion, step size, etc.choice of seed (starting value)

loss function (view 2)

2

The modeling cycle

(1) data collection, data cleaning and data pre-processing (≥ 80% of total time)

(2) selection of model class (data or algorithmic modeling culture, Breiman 2001)

(3) choice of objective function

(4) ’solving’ a (non-convex) optimization problem

(5) model validation

(6) possibly go back to (1)

B ’solving’ involves:

? choice of algorithm? choice of stopping criterion, step size, etc.? choice of seed (starting value)

loss function (view 2)

3

Car insurance frequency example

> s t r ( freMTPL2freq ) #s o u r c e R package CASdatasets’ data . frame ’ : 678013 obs . o f 12 v a r i a b l e s :$ I D p o l : num 1 3 5 10 11 13 15 17 18 21 . . .$ ClaimNb : num 1 1 1 1 1 1 1 1 1 1 . . .$ Exposure : num 0 . 1 0 . 7 7 0 . 7 5 0 . 0 9 0 . 8 4 0 . 5 2 0 . 4 5 0 . 2 7 0 . 7 1 0 . 1 5 . . .$ Area : F a c t o r w/ 6 l e v e l s ”A” ,”B” ,”C” ,”D ” , . . : 4 4 2 2 2 5 5 3 3 2 . . .$ VehPower : i n t 5 5 6 7 7 6 6 7 7 7 . . .$ VehAge : i n t 0 0 2 0 0 2 2 0 0 0 . . .$ DrivAge : i n t 55 55 52 46 46 38 38 33 33 41 . . .$ BonusMalus : i n t 50 50 50 50 50 50 50 68 68 50 . . .$ VehBrand : F a c t o r w/ 11 l e v e l s ”B1” ,” B10 ” ,” B11 ” , . . : 4 4 4 4 4 4 4 4 4 4 . . .$ VehGas : F a c t o r w/ 2 l e v e l s ” D i e s e l ” ,” R e g u l a r ” : 2 2 1 1 1 2 2 1 1 1 . . .$ D e n s i t y : i n t 1217 1217 54 76 76 3003 3003 137 137 60 . . .$ Region : F a c t o r w/ 22 l e v e l s ”R11 ” ,” R21 ” ,” R22 ” , . . : 18 18 3 15 15 8 8 20 20 12 . . .

observed frequencies per regional groups

●

●

●

●●

●

●

●●

●●●●●

●●●●●●●

●●●●●●

●

●●●●●●●●

●●●●●

●●●●●●●

●●

●

●

●●

●

●●

●●

●

●●●

●

●●●●

●

●

●

●

●

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

observed frequency per driver's age groups

driver's age groups

freq

uenc

y

18 23 28 33 38 43 48 53 58 63 68 73 78 83 88

● ●

●

●

●

●

●● ●

●●

0.00

0.05

0.10

0.15

0.20

0.25

0.30

0.35

observed frequency per car brand groups

car brand groups

freq

uenc

y

B1 B10 B11 B12 B13 B14 B2 B3 B4 B5 B6

4

Generalized linear models (GLMs)

• Determine from dataD = {(Y1,x1), . . . , (Yn,xn)} an unknown regression function

x 7→ µ(x) = E[Y ].

• Selection of model class: Poisson GLM with canonical (log-)link:

x 7→ µGLMβ (x) = exp〈β,x〉 = exp

β0 +∑j

βjxj

.• Estimate regression parameter β with maximum likelihood β

MLEby minimizing

the corresponding deviance loss (objective function)

β 7→ LD(β).

5

Example: car insurance Poisson frequencies

After pre-processing the covariates x:

# in-sample out-of-sample

param. loss (in 10−2) loss (in 10−2)

homogeneous (µ ≡ const.) 1 32.935 33.861

Model GLM (Poisson) 48 31.257 32.149

Note for low frequency examples of, say, 5%: we have in the true model LD ≈ 30.3 · 10−2.

• This convex optimization problem has a unique optimal solution.

• The solution satisfies the balance property (under the canonical link choice)

n∑i=1

Yi =

n∑i=1

exp〈βMLE

,xi〉.

6

From GLMs to neural networks

• Example of a GLM (with log-link ⇒ exponential output activation):

x 7→ µGLMβ (x) = exp〈β,x〉.

• Choose network of depth d ∈ N with network parameter θ = (θ1:d, θd+1):

x 7→ µNNθ (x) = exp 〈θd+1, z〉,

with neural network function (covariate pre-processing x 7→ z)

x 7→ z = z(d:1)θ1:d

(x) =(z(d) ◦ · · · ◦ z(1)

)(x).

7

Neural network with embeddings

• Network of depth d ∈ N with network parameter θ

x 7→ µNNθ (x) = exp 〈θd+1, z〉 = exp

⟨θd+1,

(z(d) ◦ · · · ◦ z(1)

)(x)⟩.

R94 ●R93 ●R91 ●R83 ●R82 ●R74 ●R73 ●R72 ●R54 ●R53 ●R52 ●R43 ●R42 ●R41 ●R31 ●R26 ●R25 ●R24 ●R23 ●R22 ●R21 ●R11 ●Density ●VehGas ●B6 ●B5 ●B4 ●B3 ●B2 ●B14 ●B13 ●B12 ●B11 ●B10 ●B1 ●BonusMalus ●DrivAge ●VehAge ●VehPower ●Area ●

●●●●●●●●●●●●●●●●●●●●

●●●●●●●●●●●●●●●

●

●

●

●

●

●

●

●

●

●

ClaimNb●

RegionEmb

Density

VehGas

VehBrEmb

BonusMalus

DrivAge

VehAge

VehPower

Area

ClaimNb

• Gradient descent method (GDM) provides θ w.r.t. deviance loss θ 7→ LD(θ).

• Exercise early stopping of GDM because MLE over-fits (in-sample).

8

Remarks on the neural network approach

+ Use embedding layers for categorical variables.

+ Typically, the neural network outperforms the GLM approach in terms ofout-of-sample prediction accuracy.

− Resulting prices are not unique, but depend on seeds.

− The neural network does not build on improving the GLM.

− The neural network fails to have the balance property.

9

Combined Actuarial Neural Network: part I

• Choose regression function with parameter (β, θ)

x 7→ µCANN(β,θ) (x) = exp

{〈β,x〉+

⟨θd+1,

(z(d) ◦ · · · ◦ z(1)

)(x)⟩}.

RegionEmb

Density

VehGas

VehBrEmb

BonusMalus

DrivAge

VehAge

VehPower

Area

GLM skip connection

ClaimNb

• GDM provides (β, θ) w.r.t. deviance loss (β, θ) 7→ LD(β, θ).

10

Combined Actuarial Neural Network: part II

• Choose regression function with parameter (β, θ)

µCANN(β,θ) (x) = exp

{〈β,x〉+

⟨θd+1,

(z(d) ◦ · · · ◦ z(1)

)(x)⟩}.

RegionEmb

Density

VehGas

VehBrEmb

BonusMalus

DrivAge

VehAge

VehPower

Area

GLM skip connection

ClaimNb

• GDM provides (β, θ) w.r.t. deviance loss (β, θ) 7→ LD(β, θ).

• Initialize gradient descent algorithm with βMLE

and θd+1 = 0!

11

Combined Actuarial Neural Network

0 200 400 600 800 1000

31.8

031

.85

31.9

031

.95

32.0

032

.05

32.1

0

DrivAge − VehBrand

epochs

out−

of−

sam

ple

loss

0 200 400 600 800 1000

31.8

031

.85

31.9

031

.95

32.0

032

.05

32.1

0

VehAge − VehGas

epochs

out−

of−

sam

ple

loss

Possible GDM results of the CANN approach.

12

CANN example: car insurance frequencies

# in-sample out-of-sample

param. loss (in 10−2) loss (in 10−2)

homogeneous (µ ≡ const.) 1 32.935 33.861

Model GLM (Poisson) 48 31.257 32.149

CANN (2-dim. embeddings) 792 (+48) 30.476 31.566

Note for low frequency examples of, say, 5%: we have in the true model LD ≈ 30.3 · 10−2.

●

●

●

●

●

●

●

●

●

●

●

−1.0 −0.8 −0.6 −0.4 −0.2 0.0 0.2

−0.

50.

00.

5

2−dimensional embedding of VehBrand

dimension 1

dim

ensi

on 2

B1

B10

B11

B12

B13B14

B2

B3B4

B5

B6

●

●

●

●

●

●

● ●

●

●

●

●

●

●

●

●

●

●●

●

●

●

−0.5 0.0 0.5 1.0

−0.

50.

00.

51.

0

2−dimensional embedding of Region

dimension 1

dim

ensi

on 2

R11

R21

R22

R23

R24R25

R26 R31

R41

R42

R43

R52

R53

R54R72

R73

R74R82R83

R91

R93

R94

13

Failure of balance property

0.09

60.

098

0.10

00.

102

0.10

40.

106

0.10

8

balance property over 50 SGD calibrations

average frequencies

• Box plot of 50 gradient descent calibrations

• Cyan line: balance property

• Magenta line: average of 50 gradient descent calibrations

• Balance property fails to hold.

14

Regularization step for the balance property

• Apply an additional GLM step on the learned representation

x 7→ z = z(d:1)θ1:d

(x) =(z(d) ◦ · · · ◦ z(1)

)(x),

keeping the offset 〈βMLE

,x〉 and the learned representation z fixed, ...

• ... that is, calculate MLE θMLEd+1 of θd+1 from regression function

z = z(x) 7→ exp{〈β

MLE,x〉+ 〈θd+1, z〉

}.

• Regularization step is important, in particular, when there is a class imbalance!

15

Summary

• A GLM is a special case of a neural network.

• Neural networks do covariate pre-processing themselves.

• ‘Sufficiently good’ network regression models are not unique.

• Embedding layers for categorical covariates may help improve modeling.

• CANN builds the model around a (generalized) linear function.

• An additional GLM step allows us to comply with the balance property.

• CANN allows us to identify missing structure in GLMs (more) explicitly.

Thank you!16

from generalized linear models to neural networks, and back · from generalized linear models to...

Documents