modelling with smooth functionssw15190/talks/beijing15.pdf · log( i) = +f1(ti)+f2(o3i;tmp...

Modelling with smooth functions

Simon Wood University of Bath, EPSRC funded

Some data. . .

−2000 −1000 0 1000 2000 3000

05

01

00

15

02

00

Daily respiratory deaths, temperature and ozone in Chicago (NMMAPS)

time

deaths

temperature

ozone

I Apparently daily respiratory deaths are ‘significantly’negatively correlated with ozone and temperature.

I But obviously there is confounding here - we need a model.

A model. . .

I deathi ∼ Poi(µi)

log(µi) = α+ f1(ti) + f2(o3i ,tmpi) + f3(pm10i)

I The fj are smooth functions to estimate.1. α+ f1 is the (log) background respiratory mortality rate.2. f2 is modification by ozone and temperature (interacting).3. f3 is the modification from particulates.

I Actually the predictors are aggregated over the 3 dayspreceding death.

I Fit in R (mgcv package)

gam(death˜s(time,k=200)+te(o3,tmp)+s(pm10),family=poisson)

Air pollution model estimates. . .

−2000 −1000 0 1000 2000

−0.1

5−

0.0

50.0

50.1

5

time

s(t

ime,9

9.7

5)

−50 0 50 100

050

150

250

350

te(o3,tmp,31.6)

o3

tmp

−0.2

0

0

0

0.2

0.4

10 12 14 16 18

−0.1

5−

0.0

50.0

50.1

5

pm10

s(p

m10,3

.09)

I High ozone and temp associated with increased risk.

Some more data. . .

1000 1200 1400 1600

0.0

0.2

0.4

0.6

0.8

1.0

1.2

wavelength (nm)

log(1

/R)

octane

85.3

85.25

88.45

83.4

87.9

85.5

88.9

88.3

I Can we predict octanerating of fuel from nearinfra-red spectrum?

I 60 octane/spectrum pairsavailable (from plspackage).

I Try a ‘signal regression’ model. ki is i th spectrum, f is asmooth function. . .

octanei =

∫f (ν)ki(ν)dν + εi

Signal regression model fit. . .

I If each row of matrix NIR contains a spectral intensity, andeach row of nm the corresponding wavelengths, thengam(octane˜s(nm,by=NIR)) estimates the model.

I Plots of estimated f (ν) and fitted vs. data:

1000 1200 1400 1600

−10

−5

05

nm

s(n

m,9

.16):

NIR

84 85 86 87 88 89

84

86

88

predicted octane

octa

ne

And one more dataset. . .

0 10 20 30 40 50

0.0

0.4

0.8

10 15 20 20 30 40 50 0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.4

0.8

ret

20

30

40

50

bmi

10

15

20

gly

0 10 20 30 40 50

02

04

0

dur

I How does probability of diabeticretinopathy relate to previous diseaseduration, body mass index and %glycosylated hemoglobin?

I Wisconsin study (see gss package).

I Model: reti ∼ Bernouilli,

logit{E(ret)} = f1(dur) + f2(bmi) + f3(gly)+ f4(dur,bmi) + f5(dur,gly) + f6(gly,bmi)

Retinopathy model estimates

gam(ret ˜ s(dur) + s(gly) + s(bmi) + ti(dur,gly) +ti(dur,bmi) + ti(gly,bmi), family=binomial)

0 10 20 30 40 50

−4

−2

02

46

dur

s(d

ur,

3.2

6)

10 15 20

−4

−2

02

46

gly

s(g

ly,1

)

20 30 40 50

−4

−2

02

46

bmi

s(b

mi,2.6

7)

dur

gly

te(d

ur,g

ly,0

)

dur

bm

ite

(dur,b

mi,0

)

gly

bm

i

te(g

ly,b

mi,2

.5)

Additive smooth models: structured flexibility

I The preceding are examples of generalized additivemodels (GAM).

I A GAM is a GLM, in which the linear predictor dependslinearly on unknown smooth functions of covariates. . .

yi ∼ EF (µi , φ), g(µi) = f1(x1i) + f2(x2i) + · · ·

I Hastie and Tibshirani (1990) and Wahba (1990) laid thefoundations for these models as discussed here.

I If we can work with GAMs at all we can easily generalize:

g(µi) = Aiθ +∑

j

Lij fj(xj) + Zib

— Lij are linear functionals; A & Z are model matrices; θand b are parameters and Gaussian random effects.

Additive smooth models & practical computation

I Making GAMs work in practice requires at least 3 things. . .1. A way of representing the smooth functions fj .2. A way of estimating the fj .3. Some means of deciding how smooth the fj should be.

I 3 is the awkward part of the enterprise, and stronglyinfluences 1 and 2.

I As well as point estimates we also need further inferentialtools, such as interval estimates, model selection methods,AIC, p-values etc.

I We’d also like to go beyond univariate exponential family.I This talk will cover these things in order.

Representing smooth functions: splines

I To motivate how to represent several smooth terms in amodel, first consider a simpler smoothing problem.

I Consider estimating the smooth function f in the modelyi = f (xi) + εi from xi , yi data using smoothing splines. . .

1.5 2.0 2.5 3.0

2.0

2.5

3.0

3.5

4.0

4.5

size

wear

I The red curve is the function minimizing∑i

(yi − f (xi))2 + λ

∫f ′′(x)2dx .

Splines and the smoothing parameter

I Smoothing parameter λ controls the stiffness of the spline.

1.5 2.0 2.5 3.0

2.0

3.0

4.0

size

wear

1.5 2.0 2.5 3.0

2.0

3.0

4.0

size

wear

1.5 2.0 2.5 3.0

2.0

3.0

4.0

size

wear

1.5 2.0 2.5 3.02.0

3.0

4.0

size

wear

I But the spline can be written f (x) =∑

i βibi(x), where thebasis functions bi(x) do not depend on λ.

General spline theory background

I Consider a Hilbert space of real valued functions, f , onsome domain τ (e.g. [0,1]).

I It is a reproducing kernel Hilbert space, H, if evaluation isbounded. i.e. ∃M s.t. |f (t)| ≤ M‖f‖.

I Then the Riesz representation thm says that there is afunction Rt ∈ H s.t. f (t) = 〈Rt , f 〉.

I Now consider Rt(u) as a function of t : R(t ,u)

〈Rt ,Rs〉 = R(t , s)

— so R(t , s) is known as reproducing kernel of H.I Actually, to every positive definite function R(t , s)

corresponds a unique r.k.h.s.

Spline smoothing problem

I RKHS are quite useful for constructing smooth models, tosee why consider finding f to minimize∑

i

{yi − f (ti)}2 + λ

∫f ′′(t)2dt .

I Let H have 〈f ,g〉 =∫

g′′(t)f ′′(t)dt .I Let H0 denote the RKHS of functions for which∫

f ′′(t)2dt = 0, with basis φ1(t) = 1, φ2(t) = t , say.I Spline problem seeks f ∈ H0 ⊕H to minimize∑

i

{yi − f (ti)}2 + λ‖Pf‖2.

Spline smoothing solution

I f (t) =∑n

i=1 ciRti (t) +∑M

i=1 diφi(t) is the basisrepresentation of f . Why?

I Suppose minimizer were f = f + η where η ∈ H and η ⊥ f :1. η(ti) = 〈Rti , η〉 = 0.2. ‖Pf‖2 = ‖Pf‖2 + ‖η‖2 which is minimized when η = 0.

I . . . obviously this argument is rather general.I So if Eij = 〈Rti ,Rtj 〉 and Tij = φj(ti) then we seek c and d to

minimize‖y − Td − Ec‖22 + λcTEc.

. . . straightforward to compute (but at O(n3) cost).

Reduced rank smoothers

I Can obtain efficient reduced rank basis (and penalty) by1. using spline basis for a ‘representative’ subset of data, or2. using Lanczos methods to find a low order ‘eigenbasis’.

I In either case we end up representing the smoother as

f (x) =∑

j

βjbj(x)

— basis functions, bj(x) known; coefficients β not.I Corresponding smoothing penalty is βTSβ. S is known.I So yi = f (xi) + εi becomes y = Xβ + ε where Xij = bj(xi)

and β = argminβ‖y− Xβ‖2 + λβTSβ.I Examples follow asymptotic justification of rank reduction.

What rank?

I Consider, f , a rank k cubic regression spline (i.e. λ = 0),parameterized in terms of function values at evenly spaced‘knots’.

I From basic regression theory, average sampling standarderror of f is O(

√k/n).

I From approximation theory for cubic splines, asymptoticbias of f is bounded by O(k−4).

I So if we let k = O(n1/9) we minimize the asymptotic MSEat O(n−8/9).

I Actually in practice we would want to penalize and thenk = O(n1/5) is appropriate.

I Point is that asymptotically k � n is appropriate.

P-splines: B-spline basis & approx penalty

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

x

y

basis functions

0.0 0.2 0.4 0.6 0.8 1.0

−2

02

46

81

0

x

y

∑i

(βi−1 − 2βi + βi+1)2

= 207

0.0 0.2 0.4 0.6 0.8 1.0

−2

02

46

81

0

x

y

∑i

(βi−1 − 2βi + βi+1)2

= 16.4

0.0 0.2 0.4 0.6 0.8 1.0

−2

02

46

81

0

x

y

∑i

(βi−1 − 2βi + βi+1)2

= 0.04

Example: Thin plate splinesI One way of generalizing splines from 1D to several D is to

turn the flexible strip into a flexible sheet f minimizing e.g.∑i

{yi − f (xi , zi)}2 + λ

∫f 2xx + 2f 2

xz + f 2zzdxdz

I This results in a thin plate spline. It is an isotropic smooth.I Isotropy may be appropriate when different covariates are

naturally on the same scale.

x

0.2

0.4

0.6

0.8

z

0.2

0.4

0.6

0.8

linear p

redic

tor

0.0

0.2

0.4

0.6

0.8

x

0.2

0.4

0.6

0.8

z

0.2

0.4

0.6

0.8

linear p

redic

tor

0.0

0.2

0.4

0.6

0.8

x

0.2

0.4

0.6

0.8

z

0.2

0.4

0.6

0.8

linear p

redic

tor

0.0

0.2

0.4

0.6

0.8

Cheaper thin plate splines

I RKHS or similar theory gives explicit basis (for anydimension), with coefficients (c and d) minimizing

‖y − Td− Ec‖2 + λcTEc s.t . TTc = 0.

I Can reduce O(n3) cost to O(k3) by replacing E by its rankk truncated eigen-decomposition (computed by Lanczosmethods).

I Cheap and somewhat optimal. . .

truth

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

t.p.r.s. 16 d.f.

0.0 0.2 0.4 0.6 0.8 1.0

0.0

0.2

0.4

0.6

0.8

1.0

16 knot basis

0.0 0.2 0.4 0.6 0.8 1.00

.00

.20

.40

.60

.81

.0

Non-isotropic tensor product smoothsI A different construction is needed if covariates have

different scales.I Start with marginal spline fz and let its coefficients vary

smoothly with x

xz

f(z)

xz

f(x,z

)

The complete tensor product smoothI Let each coefficient of fz be a spline of x .I Construct is symmetric (see right).

xz

f(x,z

)

xz

f(x,z

)

Tensor product penalties - one per dimensionI x-wiggliness: sum marginal x penalties over red curves.I z-wiggliness: sum marginal z penalties over green curves.

xz

f(x,z

)

xz

f(x,z

)

Tensor product expressions

I Suppose the basis expansions for smoothing w.r.t. x and zmarginally are

∑j Bjbj(z) and

∑i αiai(x).

I . . . and the marginal penalties are BTSzB and αTSxα.I The tensor product basis construction gives:

f (x , z) =∑∑

βijbj(z)ai(x)

I With two penalties (requiring 2 smoothing parameters)

J∗z (f ) = βTII ⊗ Szβ and J∗x (f ) = βTSx ⊗ IJβ

I The construction generalizes to any number of marginalsand multi-dimensional marginals.

Isotropic vs. tensor product comparison

x

2

4

6

8

z

2

4

6

8

z

0.0

0.2

0.4

Isotropic Thin Plate Spline

x

2

4

6

8

z

2

4

6

8

z

0.0

0.2

0.4

Tensor Product Spline

x

0.5

1.0

1.5

z

2

4

6

8

z

0.0

0.2

0.4

0.6

x

0.5

1.0

1.5

z

2

4

6

8

z

0.0

0.2

0.4

. . . each figure smooths the same data. The only modification isthat x has been divided by 5 in the bottom row.

Smooth model components

I A rich range of basis-penalty smoothers is possible. . .

10 20 30 40 50

−1

00

−5

00

50

10

0

a

times

s(t

ime

s,1

0.9

)

10 20 30 40 50

50

60

70

80

b

Y

X

−0.4

−0.2

−0.2

0

0.2

0.2

0.2

0.4

0.4

0.4 0.4

0.4

0.4

0.6

0.6

0.6

1.4

xz

f(x,z)

c

−0

.6−

0.4

−0

.20

.00

.20

.40

.60

.8

s(r

eg

ion

,63

.04

)

d

58.0 58.5 59.0 59.5 60.0 60.5

44

.04

4.5

45

.04

5.5

46

.0

e

lon

lat

−4

−3

−3

−2

−2

−1

−1

−1

0

0

0

1

1

2

2

2

3

4 5 5

6

6

7

−30 −20 −10

0

10

20

30

40

50

60

70

80

−160

−140

−120

−100

−80

−60

−40

−20 0

20

40 60

80

120 140

160

180

−20

−10

−10

0

10 20

30

40

f

I When combining several smooths in one model, we oftenneed an identifiability constraint

∑i f (xi) = 0. Re-write as

1TXβ = 0 and absorb by reparameterization.

Representing a GAM

I Consider the GAM yi ∼ EF(µi , φ), g(µi) = β0 +∑

j fj(xji).

I Each fj(x) has a basis - penalty representation, say Xjβj ,βjTSjβj (with constraints absorbed).

I So the GAM becomes

g(µ) = Xβ,

where X = [1 : X1 : X2 : . . .] and βT = [β0,β1T,β2T, . . .].

I Penalty is now ∑λjβ

TSjβ,

where Sj is such that βTSjβ ≡ βjTSjβj .I Can easily include parametric components and linear

functionals of smooths.

β given λ

I Let l be the model log-likelihood and use penalized MLE

L(β) = l(β)− 12

∑j

λjβTSjβ, β = argmax

βL(β)

I Optimize by Newton’s method (penalized IRLS for EF).I Bayes motivation. Prior: π(β) = N(0, (

∑λjSj)

−);likelihood, π(y|β): as is. Then MAP estimate is β.

I Further, in large sample limit, if I is information matrix at β,

β|y ∼ N(β, (I +∑

λjSj)−1)

which can be used to construct well calibrated CIs.I Note link to Gaussian random effects!

AIC, effective degrees of freedom, p-values

I Following through the derivation of AIC in the penalizedcase, yields

AIC = −2l(β) + 2tr(F)

where F = (I +∑λjSj)

−1I.I Better still F = VβI, where Vβ is an approximate posterior

covariance matrix for β, corrected for λ uncertainty (seeGreven and Kneib, 2010, Biometrika for why).

I tr(F) is the effective degrees of freedom of the model.I p-values for testing smooth terms or variance components

for equality to zero can also be obtained (see Wood2013a,b Biometrika).

I But we still need smoothing parameter (λ) estimates.

Smoothing parameter estimates

I Let ρ = logλ and π(ρ) = constant, then the marginallikelihood is

π(ρ|y) =∫π(y|β)π(β|ρ)dβ

I Empirical Bayes: ρ = argmaxλπ(ρ|y).I . . . same as REML in Gaussian random effects context.I In practice use a Laplace approximation for the integral.

logπ(ρ|y) ' −L(β)− 12

log |∑

λjSj |+ +12

log |H|+ c

H is Hessian of −L. Log | • | require numerical care.

How marginal likelihood works

0.0 0.2 0.4 0.6 0.8 1.0

−1

0−

50

51

01

52

0

λ too low, prior variance too high

x

y

0.0 0.2 0.4 0.6 0.8 1.0−

10

−5

05

10

15

20

λ and prior variance about right

x

y

0.0 0.2 0.4 0.6 0.8 1.0

−1

0−

50

51

01

52

0

λ too high, prior variance too low

x

y

I Draw β from prior implied by λ. Find average value oflikelihood for these draws.

I Choose λ to maximize this average likelihood.I i.e. formally, maximize

∫π(y|β)π(β|λ)dβ w.r.t. λ.

I Cross validation is an alternative.

Numerical fitting strategy

I Optimize REML w.r.t. ρ = log(λ) by Newton’s method. Foreach trial ρ. . .

1. Re-parameterize β so that log |S|+ computation is stable.2. Find β by an inner Newton optimization of L(β).3. Use implicit differentiation to find first two derivatives of β

w.r.t. ρ.4. Compute log determinant terms and their derivatives.5. Hence compute the REML score and its first two

derivatives, as required for the next Newton step.I For large datasets an alternative is often possible: Find β

by iterative fitting of working linear models, and estimate ρfor the working model at each iteration step.

Where’s the exponential family assumption?

I . . . the single parameter independent EF assumption ofGAMs is barely used.

I It just simplifies some of the numerical computations (andadds to numerical robustness).

I Actually we can use most of the apparatus just describedwith almost any regular likelihood, provided its practical tocompute with, and the first three or four derivatives w.r.t. tothe model coefficients can also be computed . . .

Example: predicting prostate status

5000 10000 15000 20000

010

20

30

40

50

60

Healthy

Daltons

Inte

nsity

5000 10000 15000 20000

010

20

30

40

50

60

Enlarged

Daltons

Inte

nsity

5000 10000 15000 20000

010

20

30

40

50

60

Cancer

Daltons

Inte

nsity

I Model: ordered category (benign/enlarged/cancer)predicted by logistic latent random variable with mean

µi =

∫f (D)νi(D)dD, νi(D) is i th spectrum.

gam(ds˜s(D,by=K),family=ocat(R=3))

5000 10000 15000 20000−0.0

50.0

50.1

50.2

5

a

Daltons

f(D

)

1 2 3

0.0

0.2

0.4

0.6

0.8

1.0

b

−3 −2 −1 0 1 2

−3

−1

01

23

c

theoretical quantiles

devia

nce

re

sid

ua

ls

Fuel efficiency of cars

15 25 35 45

15

25

35

45

city.mpg

20 30 40 50 60 62 64 66 68 70 72 48 50 52 54 56 58 60 1500 2500 3500 50 100 150 200 250

15

25

35

45

hw.mpg

20

30

40

50

width

60

64

68

72

height

48

52

56

60

weight

15

00

25

00

35

00

50 100 150 200 250

50

15

02

50

hp

Multivariate additive modelI Correlated bivariate normal response (hw.mpg,city.mpg).

I Component means given by smooth additive predictors.Best model very simple (and somewhat unexpected)

1500 2000 2500 3000 3500 4000

−1

00

51

01

52

0

a

weight

s(w

eig

ht,

6.7

1)

50 100 150 200 250

−1

00

51

01

52

0

b

hp

s(h

p,4

.1)

−2 −1 0 1 2

−0

.50

.00

.51

.0

c

Gaussian quantiles

effe

cts

1500 2000 2500 3000 3500 4000

−1

00

51

01

52

0

d

weight

s.1

(we

igh

t,5

.71

)

50 100 150 200 250

−1

00

51

01

52

0

e

hp

s.1

(hp,5

.69

)

−2 −1 0 1 2−

6e

−0

5−

2e

−0

52

e−

05

f

Gaussian quantiles

effe

cts

Scale location models

I Can additively model mean and variance (and skewand. . . )

I Simple example: yi ∼ N(µi , σi)

µi =∑

j

fj(xji), logσi =∑

j

gj(zji).

I Here is a simple 1-D smoothing example of this. . .

10 20 30 40 50

−1

00

−5

00

50

times

acce

l

0 10 20 30 40 50 60

01

02

03

04

05

0

times

σ

R packages

There are many alternative R packages available:1. gam for original backfitting approach1.2. vgam for vector GAMs and more1.3. gamlss GAMs for location scale and shape.2

4. mboost GAMs via boosting.5. gss Smoothing Spline ANOVA.6. scam Shaped constrained additive models.7. gamm4 GAMMs using lme4.8. bayesx MCMC and likelihood based GAMs3.9. Methods discussed here are in R recommended package

mgcv. . .

1No smoothing parameter selection2Limited smoothing parameter selection3see also mgcv::jagam

mgcv package in R

modelling with smooth functionssw15190/talks/beijing15.pdf · log( i) = +f1(ti)+f2(o3i;tmp...

Documents