from generalized linear models to neural networks, and back · from generalized linear models to...
TRANSCRIPT
From Generalized Linear Models to Neural Networks,and Back
Mario V. Wuthrich
RiskLab, ETH Zurich
April 22, 2020
One World Actuarial Research Seminar
References
• From generalized linear models to neural networks, and backSSRN Manuscript 3491790, March 2020
https://papers.ssrn.com/sol3/papers.cfm?abstract id=3491790
B This topic originates from the seminal paper:
• Generalized linear modelsNelder, J.A., Wedderburn, R.W.M. (1972)Journal of the Royal Statistical Society, Series A (General) 135/3, 370-384
B For more (historical) references: see our SSRN Manuscript.
1
The modeling cycle
(1) data collection, data cleaning and data pre-processing (≥ 80% of total time)
(2) selection of model class (data or algorithmic modeling culture, Breiman 2001)
(3) choice of objective function
(4) ’solving’ a (non-convex) optimization problem
(5) model validation
(6) possibly go back to (1)
B ’solving’ involves:
choice of algorithmchoice of stopping criterion, step size, etc.choice of seed (starting value)
loss function (view 2)
2
The modeling cycle
(1) data collection, data cleaning and data pre-processing (≥ 80% of total time)
(2) selection of model class (data or algorithmic modeling culture, Breiman 2001)
(3) choice of objective function
(4) ’solving’ a (non-convex) optimization problem
(5) model validation
(6) possibly go back to (1)
B ’solving’ involves:
? choice of algorithm? choice of stopping criterion, step size, etc.? choice of seed (starting value)
loss function (view 2)
3
Car insurance frequency example
> s t r ( freMTPL2freq ) #s o u r c e R package CASdatasets’ data . frame ’ : 678013 obs . o f 12 v a r i a b l e s :$ I D p o l : num 1 3 5 10 11 13 15 17 18 21 . . .$ ClaimNb : num 1 1 1 1 1 1 1 1 1 1 . . .$ Exposure : num 0 . 1 0 . 7 7 0 . 7 5 0 . 0 9 0 . 8 4 0 . 5 2 0 . 4 5 0 . 2 7 0 . 7 1 0 . 1 5 . . .$ Area : F a c t o r w/ 6 l e v e l s ”A” ,”B” ,”C” ,”D ” , . . : 4 4 2 2 2 5 5 3 3 2 . . .$ VehPower : i n t 5 5 6 7 7 6 6 7 7 7 . . .$ VehAge : i n t 0 0 2 0 0 2 2 0 0 0 . . .$ DrivAge : i n t 55 55 52 46 46 38 38 33 33 41 . . .$ BonusMalus : i n t 50 50 50 50 50 50 50 68 68 50 . . .$ VehBrand : F a c t o r w/ 11 l e v e l s ”B1” ,” B10 ” ,” B11 ” , . . : 4 4 4 4 4 4 4 4 4 4 . . .$ VehGas : F a c t o r w/ 2 l e v e l s ” D i e s e l ” ,” R e g u l a r ” : 2 2 1 1 1 2 2 1 1 1 . . .$ D e n s i t y : i n t 1217 1217 54 76 76 3003 3003 137 137 60 . . .$ Region : F a c t o r w/ 22 l e v e l s ”R11 ” ,” R21 ” ,” R22 ” , . . : 18 18 3 15 15 8 8 20 20 12 . . .
observed frequencies per regional groups
●
●
●
●●
●
●
●●
●●●●●
●●●●●●●
●●●●●●
●
●●●●●●●●
●●●●●
●●●●●●●
●●
●
●
●●
●
●●
●●
●
●●●
●
●●●●
●
●
●
●
●
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
observed frequency per driver's age groups
driver's age groups
freq
uenc
y
18 23 28 33 38 43 48 53 58 63 68 73 78 83 88
● ●
●
●
●
●
●● ●
●●
0.00
0.05
0.10
0.15
0.20
0.25
0.30
0.35
observed frequency per car brand groups
car brand groups
freq
uenc
y
B1 B10 B11 B12 B13 B14 B2 B3 B4 B5 B6
4
Generalized linear models (GLMs)
• Determine from dataD = {(Y1,x1), . . . , (Yn,xn)} an unknown regression function
x 7→ µ(x) = E[Y ].
• Selection of model class: Poisson GLM with canonical (log-)link:
x 7→ µGLMβ (x) = exp〈β,x〉 = exp
β0 +∑j
βjxj
.• Estimate regression parameter β with maximum likelihood β
MLEby minimizing
the corresponding deviance loss (objective function)
β 7→ LD(β).
5
Example: car insurance Poisson frequencies
After pre-processing the covariates x:
# in-sample out-of-sample
param. loss (in 10−2) loss (in 10−2)
homogeneous (µ ≡ const.) 1 32.935 33.861
Model GLM (Poisson) 48 31.257 32.149
Note for low frequency examples of, say, 5%: we have in the true model LD ≈ 30.3 · 10−2.
• This convex optimization problem has a unique optimal solution.
• The solution satisfies the balance property (under the canonical link choice)
n∑i=1
Yi =
n∑i=1
exp〈βMLE
,xi〉.
6
From GLMs to neural networks
• Example of a GLM (with log-link ⇒ exponential output activation):
x 7→ µGLMβ (x) = exp〈β,x〉.
• Choose network of depth d ∈ N with network parameter θ = (θ1:d, θd+1):
x 7→ µNNθ (x) = exp 〈θd+1, z〉,
with neural network function (covariate pre-processing x 7→ z)
x 7→ z = z(d:1)θ1:d
(x) =(z(d) ◦ · · · ◦ z(1)
)(x).
7
Neural network with embeddings
• Network of depth d ∈ N with network parameter θ
x 7→ µNNθ (x) = exp 〈θd+1, z〉 = exp
⟨θd+1,
(z(d) ◦ · · · ◦ z(1)
)(x)⟩.
R94 ●R93 ●R91 ●R83 ●R82 ●R74 ●R73 ●R72 ●R54 ●R53 ●R52 ●R43 ●R42 ●R41 ●R31 ●R26 ●R25 ●R24 ●R23 ●R22 ●R21 ●R11 ●Density ●VehGas ●B6 ●B5 ●B4 ●B3 ●B2 ●B14 ●B13 ●B12 ●B11 ●B10 ●B1 ●BonusMalus ●DrivAge ●VehAge ●VehPower ●Area ●
●●●●●●●●●●●●●●●●●●●●
●●●●●●●●●●●●●●●
●
●
●
●
●
●
●
●
●
●
ClaimNb●
RegionEmb
Density
VehGas
VehBrEmb
BonusMalus
DrivAge
VehAge
VehPower
Area
ClaimNb
• Gradient descent method (GDM) provides θ w.r.t. deviance loss θ 7→ LD(θ).
• Exercise early stopping of GDM because MLE over-fits (in-sample).
8
Remarks on the neural network approach
+ Use embedding layers for categorical variables.
+ Typically, the neural network outperforms the GLM approach in terms ofout-of-sample prediction accuracy.
− Resulting prices are not unique, but depend on seeds.
− The neural network does not build on improving the GLM.
− The neural network fails to have the balance property.
9
Combined Actuarial Neural Network: part I
• Choose regression function with parameter (β, θ)
x 7→ µCANN(β,θ) (x) = exp
{〈β,x〉+
⟨θd+1,
(z(d) ◦ · · · ◦ z(1)
)(x)⟩}.
RegionEmb
Density
VehGas
VehBrEmb
BonusMalus
DrivAge
VehAge
VehPower
Area
GLM skip connection
ClaimNb
• GDM provides (β, θ) w.r.t. deviance loss (β, θ) 7→ LD(β, θ).
10
Combined Actuarial Neural Network: part II
• Choose regression function with parameter (β, θ)
µCANN(β,θ) (x) = exp
{〈β,x〉+
⟨θd+1,
(z(d) ◦ · · · ◦ z(1)
)(x)⟩}.
RegionEmb
Density
VehGas
VehBrEmb
BonusMalus
DrivAge
VehAge
VehPower
Area
GLM skip connection
ClaimNb
• GDM provides (β, θ) w.r.t. deviance loss (β, θ) 7→ LD(β, θ).
• Initialize gradient descent algorithm with βMLE
and θd+1 = 0!
11
Combined Actuarial Neural Network
0 200 400 600 800 1000
31.8
031
.85
31.9
031
.95
32.0
032
.05
32.1
0
DrivAge − VehBrand
epochs
out−
of−
sam
ple
loss
0 200 400 600 800 1000
31.8
031
.85
31.9
031
.95
32.0
032
.05
32.1
0
VehAge − VehGas
epochs
out−
of−
sam
ple
loss
Possible GDM results of the CANN approach.
12
CANN example: car insurance frequencies
# in-sample out-of-sample
param. loss (in 10−2) loss (in 10−2)
homogeneous (µ ≡ const.) 1 32.935 33.861
Model GLM (Poisson) 48 31.257 32.149
CANN (2-dim. embeddings) 792 (+48) 30.476 31.566
Note for low frequency examples of, say, 5%: we have in the true model LD ≈ 30.3 · 10−2.
●
●
●
●
●
●
●
●
●
●
●
−1.0 −0.8 −0.6 −0.4 −0.2 0.0 0.2
−0.
50.
00.
5
2−dimensional embedding of VehBrand
dimension 1
dim
ensi
on 2
B1
B10
B11
B12
B13B14
B2
B3B4
B5
B6
●
●
●
●
●
●
● ●
●
●
●
●
●
●
●
●
●
●●
●
●
●
−0.5 0.0 0.5 1.0
−0.
50.
00.
51.
0
2−dimensional embedding of Region
dimension 1
dim
ensi
on 2
R11
R21
R22
R23
R24R25
R26 R31
R41
R42
R43
R52
R53
R54R72
R73
R74R82R83
R91
R93
R94
13
Failure of balance property
0.09
60.
098
0.10
00.
102
0.10
40.
106
0.10
8
balance property over 50 SGD calibrations
average frequencies
• Box plot of 50 gradient descent calibrations
• Cyan line: balance property
• Magenta line: average of 50 gradient descent calibrations
• Balance property fails to hold.
14
Regularization step for the balance property
• Apply an additional GLM step on the learned representation
x 7→ z = z(d:1)θ1:d
(x) =(z(d) ◦ · · · ◦ z(1)
)(x),
keeping the offset 〈βMLE
,x〉 and the learned representation z fixed, ...
• ... that is, calculate MLE θMLEd+1 of θd+1 from regression function
z = z(x) 7→ exp{〈β
MLE,x〉+ 〈θd+1, z〉
}.
• Regularization step is important, in particular, when there is a class imbalance!
15
Summary
• A GLM is a special case of a neural network.
• Neural networks do covariate pre-processing themselves.
• ‘Sufficiently good’ network regression models are not unique.
• Embedding layers for categorical covariates may help improve modeling.
• CANN builds the model around a (generalized) linear function.
• An additional GLM step allows us to comply with the balance property.
• CANN allows us to identify missing structure in GLMs (more) explicitly.
Thank you!16