modelling with smooth functionssw15190/talks/beijing15.pdf · log( i) = +f1(ti)+f2(o3i;tmp...
TRANSCRIPT
Modelling with smooth functions
Simon Wood University of Bath, EPSRC funded
Some data. . .
−2000 −1000 0 1000 2000 3000
05
01
00
15
02
00
Daily respiratory deaths, temperature and ozone in Chicago (NMMAPS)
time
deaths
temperature
ozone
I Apparently daily respiratory deaths are ‘significantly’negatively correlated with ozone and temperature.
I But obviously there is confounding here - we need a model.
A model. . .
I deathi ∼ Poi(µi)
log(µi) = α+ f1(ti) + f2(o3i ,tmpi) + f3(pm10i)
I The fj are smooth functions to estimate.1. α+ f1 is the (log) background respiratory mortality rate.2. f2 is modification by ozone and temperature (interacting).3. f3 is the modification from particulates.
I Actually the predictors are aggregated over the 3 dayspreceding death.
I Fit in R (mgcv package)
gam(death˜s(time,k=200)+te(o3,tmp)+s(pm10),family=poisson)
Air pollution model estimates. . .
−2000 −1000 0 1000 2000
−0.1
5−
0.0
50.0
50.1
5
time
s(t
ime,9
9.7
5)
−50 0 50 100
050
150
250
350
te(o3,tmp,31.6)
o3
tmp
−0.2
0
0
0
0.2
0.4
10 12 14 16 18
−0.1
5−
0.0
50.0
50.1
5
pm10
s(p
m10,3
.09)
I High ozone and temp associated with increased risk.
Some more data. . .
1000 1200 1400 1600
0.0
0.2
0.4
0.6
0.8
1.0
1.2
wavelength (nm)
log(1
/R)
octane
85.3
85.25
88.45
83.4
87.9
85.5
88.9
88.3
I Can we predict octanerating of fuel from nearinfra-red spectrum?
I 60 octane/spectrum pairsavailable (from plspackage).
I Try a ‘signal regression’ model. ki is i th spectrum, f is asmooth function. . .
octanei =
∫f (ν)ki(ν)dν + εi
Signal regression model fit. . .
I If each row of matrix NIR contains a spectral intensity, andeach row of nm the corresponding wavelengths, thengam(octane˜s(nm,by=NIR)) estimates the model.
I Plots of estimated f (ν) and fitted vs. data:
1000 1200 1400 1600
−10
−5
05
nm
s(n
m,9
.16):
NIR
84 85 86 87 88 89
84
86
88
predicted octane
octa
ne
And one more dataset. . .
0 10 20 30 40 50
0.0
0.4
0.8
10 15 20 20 30 40 50 0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.4
0.8
ret
20
30
40
50
bmi
10
15
20
gly
0 10 20 30 40 50
02
04
0
dur
I How does probability of diabeticretinopathy relate to previous diseaseduration, body mass index and %glycosylated hemoglobin?
I Wisconsin study (see gss package).
I Model: reti ∼ Bernouilli,
logit{E(ret)} = f1(dur) + f2(bmi) + f3(gly)+ f4(dur,bmi) + f5(dur,gly) + f6(gly,bmi)
Retinopathy model estimates
gam(ret ˜ s(dur) + s(gly) + s(bmi) + ti(dur,gly) +ti(dur,bmi) + ti(gly,bmi), family=binomial)
0 10 20 30 40 50
−4
−2
02
46
dur
s(d
ur,
3.2
6)
10 15 20
−4
−2
02
46
gly
s(g
ly,1
)
20 30 40 50
−4
−2
02
46
bmi
s(b
mi,2.6
7)
dur
gly
te(d
ur,g
ly,0
)
dur
bm
ite
(dur,b
mi,0
)
gly
bm
i
te(g
ly,b
mi,2
.5)
Additive smooth models: structured flexibility
I The preceding are examples of generalized additivemodels (GAM).
I A GAM is a GLM, in which the linear predictor dependslinearly on unknown smooth functions of covariates. . .
yi ∼ EF (µi , φ), g(µi) = f1(x1i) + f2(x2i) + · · ·
I Hastie and Tibshirani (1990) and Wahba (1990) laid thefoundations for these models as discussed here.
I If we can work with GAMs at all we can easily generalize:
g(µi) = Aiθ +∑
j
Lij fj(xj) + Zib
— Lij are linear functionals; A & Z are model matrices; θand b are parameters and Gaussian random effects.
Additive smooth models & practical computation
I Making GAMs work in practice requires at least 3 things. . .1. A way of representing the smooth functions fj .2. A way of estimating the fj .3. Some means of deciding how smooth the fj should be.
I 3 is the awkward part of the enterprise, and stronglyinfluences 1 and 2.
I As well as point estimates we also need further inferentialtools, such as interval estimates, model selection methods,AIC, p-values etc.
I We’d also like to go beyond univariate exponential family.I This talk will cover these things in order.
Representing smooth functions: splines
I To motivate how to represent several smooth terms in amodel, first consider a simpler smoothing problem.
I Consider estimating the smooth function f in the modelyi = f (xi) + εi from xi , yi data using smoothing splines. . .
1.5 2.0 2.5 3.0
2.0
2.5
3.0
3.5
4.0
4.5
size
wear
I The red curve is the function minimizing∑i
(yi − f (xi))2 + λ
∫f ′′(x)2dx .
Splines and the smoothing parameter
I Smoothing parameter λ controls the stiffness of the spline.
1.5 2.0 2.5 3.0
2.0
3.0
4.0
size
wear
1.5 2.0 2.5 3.0
2.0
3.0
4.0
size
wear
1.5 2.0 2.5 3.0
2.0
3.0
4.0
size
wear
1.5 2.0 2.5 3.02.0
3.0
4.0
size
wear
I But the spline can be written f (x) =∑
i βibi(x), where thebasis functions bi(x) do not depend on λ.
General spline theory background
I Consider a Hilbert space of real valued functions, f , onsome domain τ (e.g. [0,1]).
I It is a reproducing kernel Hilbert space, H, if evaluation isbounded. i.e. ∃M s.t. |f (t)| ≤ M‖f‖.
I Then the Riesz representation thm says that there is afunction Rt ∈ H s.t. f (t) = 〈Rt , f 〉.
I Now consider Rt(u) as a function of t : R(t ,u)
〈Rt ,Rs〉 = R(t , s)
— so R(t , s) is known as reproducing kernel of H.I Actually, to every positive definite function R(t , s)
corresponds a unique r.k.h.s.
Spline smoothing problem
I RKHS are quite useful for constructing smooth models, tosee why consider finding f to minimize∑
i
{yi − f (ti)}2 + λ
∫f ′′(t)2dt .
I Let H have 〈f ,g〉 =∫
g′′(t)f ′′(t)dt .I Let H0 denote the RKHS of functions for which∫
f ′′(t)2dt = 0, with basis φ1(t) = 1, φ2(t) = t , say.I Spline problem seeks f ∈ H0 ⊕H to minimize∑
i
{yi − f (ti)}2 + λ‖Pf‖2.
Spline smoothing solution
I f (t) =∑n
i=1 ciRti (t) +∑M
i=1 diφi(t) is the basisrepresentation of f . Why?
I Suppose minimizer were f = f + η where η ∈ H and η ⊥ f :1. η(ti) = 〈Rti , η〉 = 0.2. ‖Pf‖2 = ‖Pf‖2 + ‖η‖2 which is minimized when η = 0.
I . . . obviously this argument is rather general.I So if Eij = 〈Rti ,Rtj 〉 and Tij = φj(ti) then we seek c and d to
minimize‖y − Td − Ec‖22 + λcTEc.
. . . straightforward to compute (but at O(n3) cost).
Reduced rank smoothers
I Can obtain efficient reduced rank basis (and penalty) by1. using spline basis for a ‘representative’ subset of data, or2. using Lanczos methods to find a low order ‘eigenbasis’.
I In either case we end up representing the smoother as
f (x) =∑
j
βjbj(x)
— basis functions, bj(x) known; coefficients β not.I Corresponding smoothing penalty is βTSβ. S is known.I So yi = f (xi) + εi becomes y = Xβ + ε where Xij = bj(xi)
and β = argminβ‖y− Xβ‖2 + λβTSβ.I Examples follow asymptotic justification of rank reduction.
What rank?
I Consider, f , a rank k cubic regression spline (i.e. λ = 0),parameterized in terms of function values at evenly spaced‘knots’.
I From basic regression theory, average sampling standarderror of f is O(
√k/n).
I From approximation theory for cubic splines, asymptoticbias of f is bounded by O(k−4).
I So if we let k = O(n1/9) we minimize the asymptotic MSEat O(n−8/9).
I Actually in practice we would want to penalize and thenk = O(n1/5) is appropriate.
I Point is that asymptotically k � n is appropriate.
P-splines: B-spline basis & approx penalty
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
x
y
basis functions
0.0 0.2 0.4 0.6 0.8 1.0
−2
02
46
81
0
x
y
∑i
(βi−1 − 2βi + βi+1)2
= 207
0.0 0.2 0.4 0.6 0.8 1.0
−2
02
46
81
0
x
y
∑i
(βi−1 − 2βi + βi+1)2
= 16.4
0.0 0.2 0.4 0.6 0.8 1.0
−2
02
46
81
0
x
y
∑i
(βi−1 − 2βi + βi+1)2
= 0.04
Example: Thin plate splinesI One way of generalizing splines from 1D to several D is to
turn the flexible strip into a flexible sheet f minimizing e.g.∑i
{yi − f (xi , zi)}2 + λ
∫f 2xx + 2f 2
xz + f 2zzdxdz
I This results in a thin plate spline. It is an isotropic smooth.I Isotropy may be appropriate when different covariates are
naturally on the same scale.
x
0.2
0.4
0.6
0.8
z
0.2
0.4
0.6
0.8
linear p
redic
tor
0.0
0.2
0.4
0.6
0.8
x
0.2
0.4
0.6
0.8
z
0.2
0.4
0.6
0.8
linear p
redic
tor
0.0
0.2
0.4
0.6
0.8
x
0.2
0.4
0.6
0.8
z
0.2
0.4
0.6
0.8
linear p
redic
tor
0.0
0.2
0.4
0.6
0.8
Cheaper thin plate splines
I RKHS or similar theory gives explicit basis (for anydimension), with coefficients (c and d) minimizing
‖y − Td− Ec‖2 + λcTEc s.t . TTc = 0.
I Can reduce O(n3) cost to O(k3) by replacing E by its rankk truncated eigen-decomposition (computed by Lanczosmethods).
I Cheap and somewhat optimal. . .
truth
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
t.p.r.s. 16 d.f.
0.0 0.2 0.4 0.6 0.8 1.0
0.0
0.2
0.4
0.6
0.8
1.0
16 knot basis
0.0 0.2 0.4 0.6 0.8 1.00
.00
.20
.40
.60
.81
.0
Non-isotropic tensor product smoothsI A different construction is needed if covariates have
different scales.I Start with marginal spline fz and let its coefficients vary
smoothly with x
xz
f(z)
xz
f(x,z
)
The complete tensor product smoothI Let each coefficient of fz be a spline of x .I Construct is symmetric (see right).
xz
f(x,z
)
xz
f(x,z
)
Tensor product penalties - one per dimensionI x-wiggliness: sum marginal x penalties over red curves.I z-wiggliness: sum marginal z penalties over green curves.
xz
f(x,z
)
xz
f(x,z
)
Tensor product expressions
I Suppose the basis expansions for smoothing w.r.t. x and zmarginally are
∑j Bjbj(z) and
∑i αiai(x).
I . . . and the marginal penalties are BTSzB and αTSxα.I The tensor product basis construction gives:
f (x , z) =∑∑
βijbj(z)ai(x)
I With two penalties (requiring 2 smoothing parameters)
J∗z (f ) = βTII ⊗ Szβ and J∗x (f ) = βTSx ⊗ IJβ
I The construction generalizes to any number of marginalsand multi-dimensional marginals.
Isotropic vs. tensor product comparison
x
2
4
6
8
z
2
4
6
8
z
0.0
0.2
0.4
Isotropic Thin Plate Spline
x
2
4
6
8
z
2
4
6
8
z
0.0
0.2
0.4
Tensor Product Spline
x
0.5
1.0
1.5
z
2
4
6
8
z
0.0
0.2
0.4
0.6
x
0.5
1.0
1.5
z
2
4
6
8
z
0.0
0.2
0.4
. . . each figure smooths the same data. The only modification isthat x has been divided by 5 in the bottom row.
Smooth model components
I A rich range of basis-penalty smoothers is possible. . .
10 20 30 40 50
−1
00
−5
00
50
10
0
a
times
s(t
ime
s,1
0.9
)
10 20 30 40 50
50
60
70
80
b
Y
X
−0.4
−0.2
−0.2
0
0.2
0.2
0.2
0.4
0.4
0.4 0.4
0.4
0.4
0.6
0.6
0.6
1.4
xz
f(x,z)
c
−0
.6−
0.4
−0
.20
.00
.20
.40
.60
.8
s(r
eg
ion
,63
.04
)
d
58.0 58.5 59.0 59.5 60.0 60.5
44
.04
4.5
45
.04
5.5
46
.0
e
lon
lat
−4
−3
−3
−2
−2
−1
−1
−1
0
0
0
1
1
2
2
2
3
4 5 5
6
6
7
−30 −20 −10
0
10
20
30
40
50
60
70
80
−160
−140
−120
−100
−80
−60
−40
−20 0
20
40 60
80
120 140
160
180
−20
−10
−10
0
10 20
30
40
f
I When combining several smooths in one model, we oftenneed an identifiability constraint
∑i f (xi) = 0. Re-write as
1TXβ = 0 and absorb by reparameterization.
Representing a GAM
I Consider the GAM yi ∼ EF(µi , φ), g(µi) = β0 +∑
j fj(xji).
I Each fj(x) has a basis - penalty representation, say Xjβj ,βjTSjβj (with constraints absorbed).
I So the GAM becomes
g(µ) = Xβ,
where X = [1 : X1 : X2 : . . .] and βT = [β0,β1T,β2T, . . .].
I Penalty is now ∑λjβ
TSjβ,
where Sj is such that βTSjβ ≡ βjTSjβj .I Can easily include parametric components and linear
functionals of smooths.
β given λ
I Let l be the model log-likelihood and use penalized MLE
L(β) = l(β)− 12
∑j
λjβTSjβ, β = argmax
βL(β)
I Optimize by Newton’s method (penalized IRLS for EF).I Bayes motivation. Prior: π(β) = N(0, (
∑λjSj)
−);likelihood, π(y|β): as is. Then MAP estimate is β.
I Further, in large sample limit, if I is information matrix at β,
β|y ∼ N(β, (I +∑
λjSj)−1)
which can be used to construct well calibrated CIs.I Note link to Gaussian random effects!
AIC, effective degrees of freedom, p-values
I Following through the derivation of AIC in the penalizedcase, yields
AIC = −2l(β) + 2tr(F)
where F = (I +∑λjSj)
−1I.I Better still F = VβI, where Vβ is an approximate posterior
covariance matrix for β, corrected for λ uncertainty (seeGreven and Kneib, 2010, Biometrika for why).
I tr(F) is the effective degrees of freedom of the model.I p-values for testing smooth terms or variance components
for equality to zero can also be obtained (see Wood2013a,b Biometrika).
I But we still need smoothing parameter (λ) estimates.
Smoothing parameter estimates
I Let ρ = logλ and π(ρ) = constant, then the marginallikelihood is
π(ρ|y) =∫π(y|β)π(β|ρ)dβ
I Empirical Bayes: ρ = argmaxλπ(ρ|y).I . . . same as REML in Gaussian random effects context.I In practice use a Laplace approximation for the integral.
logπ(ρ|y) ' −L(β)− 12
log |∑
λjSj |+ +12
log |H|+ c
H is Hessian of −L. Log | • | require numerical care.
How marginal likelihood works
0.0 0.2 0.4 0.6 0.8 1.0
−1
0−
50
51
01
52
0
λ too low, prior variance too high
x
y
0.0 0.2 0.4 0.6 0.8 1.0−
10
−5
05
10
15
20
λ and prior variance about right
x
y
0.0 0.2 0.4 0.6 0.8 1.0
−1
0−
50
51
01
52
0
λ too high, prior variance too low
x
y
I Draw β from prior implied by λ. Find average value oflikelihood for these draws.
I Choose λ to maximize this average likelihood.I i.e. formally, maximize
∫π(y|β)π(β|λ)dβ w.r.t. λ.
I Cross validation is an alternative.
Numerical fitting strategy
I Optimize REML w.r.t. ρ = log(λ) by Newton’s method. Foreach trial ρ. . .
1. Re-parameterize β so that log |S|+ computation is stable.2. Find β by an inner Newton optimization of L(β).3. Use implicit differentiation to find first two derivatives of β
w.r.t. ρ.4. Compute log determinant terms and their derivatives.5. Hence compute the REML score and its first two
derivatives, as required for the next Newton step.I For large datasets an alternative is often possible: Find β
by iterative fitting of working linear models, and estimate ρfor the working model at each iteration step.
Where’s the exponential family assumption?
I . . . the single parameter independent EF assumption ofGAMs is barely used.
I It just simplifies some of the numerical computations (andadds to numerical robustness).
I Actually we can use most of the apparatus just describedwith almost any regular likelihood, provided its practical tocompute with, and the first three or four derivatives w.r.t. tothe model coefficients can also be computed . . .
Example: predicting prostate status
5000 10000 15000 20000
010
20
30
40
50
60
Healthy
Daltons
Inte
nsity
5000 10000 15000 20000
010
20
30
40
50
60
Enlarged
Daltons
Inte
nsity
5000 10000 15000 20000
010
20
30
40
50
60
Cancer
Daltons
Inte
nsity
I Model: ordered category (benign/enlarged/cancer)predicted by logistic latent random variable with mean
µi =
∫f (D)νi(D)dD, νi(D) is i th spectrum.
gam(ds˜s(D,by=K),family=ocat(R=3))
5000 10000 15000 20000−0.0
50.0
50.1
50.2
5
a
Daltons
f(D
)
1 2 3
0.0
0.2
0.4
0.6
0.8
1.0
b
−3 −2 −1 0 1 2
−3
−1
01
23
c
theoretical quantiles
devia
nce
re
sid
ua
ls
Fuel efficiency of cars
15 25 35 45
15
25
35
45
city.mpg
20 30 40 50 60 62 64 66 68 70 72 48 50 52 54 56 58 60 1500 2500 3500 50 100 150 200 250
15
25
35
45
hw.mpg
20
30
40
50
width
60
64
68
72
height
48
52
56
60
weight
15
00
25
00
35
00
50 100 150 200 250
50
15
02
50
hp
Multivariate additive modelI Correlated bivariate normal response (hw.mpg,city.mpg).
I Component means given by smooth additive predictors.Best model very simple (and somewhat unexpected)
1500 2000 2500 3000 3500 4000
−1
00
51
01
52
0
a
weight
s(w
eig
ht,
6.7
1)
50 100 150 200 250
−1
00
51
01
52
0
b
hp
s(h
p,4
.1)
−2 −1 0 1 2
−0
.50
.00
.51
.0
c
Gaussian quantiles
effe
cts
1500 2000 2500 3000 3500 4000
−1
00
51
01
52
0
d
weight
s.1
(we
igh
t,5
.71
)
50 100 150 200 250
−1
00
51
01
52
0
e
hp
s.1
(hp,5
.69
)
−2 −1 0 1 2−
6e
−0
5−
2e
−0
52
e−
05
f
Gaussian quantiles
effe
cts
Scale location models
I Can additively model mean and variance (and skewand. . . )
I Simple example: yi ∼ N(µi , σi)
µi =∑
j
fj(xji), logσi =∑
j
gj(zji).
I Here is a simple 1-D smoothing example of this. . .
10 20 30 40 50
−1
00
−5
00
50
times
acce
l
0 10 20 30 40 50 60
01
02
03
04
05
0
times
σ
R packages
There are many alternative R packages available:1. gam for original backfitting approach1.2. vgam for vector GAMs and more1.3. gamlss GAMs for location scale and shape.2
4. mboost GAMs via boosting.5. gss Smoothing Spline ANOVA.6. scam Shaped constrained additive models.7. gamm4 GAMMs using lme4.8. bayesx MCMC and likelihood based GAMs3.9. Methods discussed here are in R recommended package
mgcv. . .
1No smoothing parameter selection2Limited smoothing parameter selection3see also mgcv::jagam
mgcv package in R