semi possible model semiparametric modeling, penalized ... · semi 11 slide 21 selecting the number...

Semi 1

Slide 1

Semiparametric Modeling, PenalizedSplines, and Mixed Models

David Ruppert

Cornell University

http://www.orie.cornell.edu/~davidr

January 2004

Joint work with Babette Brumback, Ray Carroll, Brent

Coull, Ciprian Crainiceanu, Matt Wand, Yan Yu, and

others

Slide 2

Example (data from Hastie and James, this

analysis in RWC)

age (years)

spin

al b

one

min

eral

den

sity

10 15 20 25

0.6

0.8

1.0

1.2

1.4

Semi 2

Slide 3

Possible Model

SBMDi,j is spinal bone mineral density on ith subject at

age equal to agei,j.

SBMDi,j = Ui + m(agei,j) + εi,j,

i = 1, . . . , m = 230, j = i, . . . , ni.

Ui is the random intercept for subject i.

{Ui} are assumed i.i.d. N(0, σ2U).

Slide 4

Underlying philosophy

1. minimalist statistics

• keep it as simple as possible

2. build on classical parametric statistics

3. modular methodology

Semi 3

Slide 5

Reference

Semiparametric Regression by Ruppert, Wand, and

Carroll (2003)

• Lots of examples from biostatistics.

Slide 6

Recent Example — April 17, 2003

Canfield et al. (2003) — Intellectual impairment

and blood lead.

• longitudinal (mixed model)

• nine covariates (modelled linearly)

• effect of lead modelled as a spline (semiparametric

model)

– disturbing conclusion

Semi 4

Slide 7

0 5 10 15 20 25 30 3560

70

80

90

100

110

120

130

lead (microgram/deciliter)

IQ

Quadratic

Spline

Thanks to Rich Canfield for data and estimates.

Slide 8

Semiparametric regression

Partial linear or partial spline model:

Yi = WTi βββW + m(Xi) + εi.

m(x) = XTi βββX + BT(x)b.

BT(x) = ( B1(x) · · · BK(x) ) .

E.g.,

XTi = ( Xi · · · Xp

i )

BT(x) = { (x− κ1)p+ · · · (x− κK)p

+ }

Semi 5

Slide 9

Example

m(x) = β0 + β1x + b1(x− κ1)+ + · · ·+ bK(x− κK)+

• slope jumps by bk at κk

Slide 10

Linear “plus” function

0 0.5 1 1.5 2 2.5 3

0

0.2

0.4

0.6

0.8

1

1.2

1.4

1.6

1.8

2 plus fn.derivative

Semi 6

Slide 11

Fitting LIDAR data with plus functions

range

log

ratio

400 500 600 700

-1.0

-0.8

-0.6

-0.4

-0.2

0.0

Slide 12

Generalization

m(x) = β0+β1x+· · ·+βpxp+b1(x−κ1)

p++· · ·+bK(x−κK)p

+

• pth derivative jumps by p! bk at κk

• first p− 1 derivatives are continuous

Semi 7

Slide 13

Quadratic “plus” function

0 0.5 1 1.5 2 2.5 30

0.5

1

1.5

2

2.5

3

3.5

4 plus fn.derivative2nd derivative

Slide 14

Ordinary Least Squares

400 600−1

−0.8

−0.6

−0.4

−0.2

0

Raw Data

400 600−1

−0.8

−0.6

−0.4

−0.2

0

2 knots

400 600−1

−0.8

−0.6

−0.4

−0.2

0

3 knots

400 600−1

−0.8

−0.6

−0.4

−0.2

0

5 knots

400 600−1

−0.8

−0.6

−0.4

−0.2

0

10 knots

400 600−1

−0.8

−0.6

−0.4

−0.2

0

20 knots

400 600−1

−0.8

−0.6

−0.4

−0.2

0

50 knots

400 600−1

−0.8

−0.6

−0.4

−0.2

0

100 knots

Semi 8

Slide 15

Penalized least-squares

Minimizen∑

i=1

{Y − (WT

i βββW + XTi βββX + BT(Xi)b)

}2+ λbTDb.

E.g.,

D = I.

Slide 16

Penalized Least Squares

400 600−1

−0.8

−0.6

−0.4

−0.2

0

Raw Data

400 600−1

−0.8

−0.6

−0.4

−0.2

0

2 knots

400 600−1

−0.8

−0.6

−0.4

−0.2

0

3 knots

400 600−1

−0.8

−0.6

−0.4

−0.2

0

5 knots

400 600−1

−0.8

−0.6

−0.4

−0.2

0

10 knots

400 600−1

−0.8

−0.6

−0.4

−0.2

0

20 knots

400 600−1

−0.8

−0.6

−0.4

−0.2

0

50 knots

400 600−1

−0.8

−0.6

−0.4

−0.2

0

100 knots

Semi 9

Slide 17

Ridge Regression

From previous slide:

n∑i=1

{Y − (WT

i βββW + XTi βββX + BT(Xi)b)

}2+ λbTDb.

Let X have row (WTi XT

i BT(Xi) ). Then

βββW

βββX

b

=

{X TX + λ blockdiag(0,0,D)}−1X TY.

• Also, a BLUP in a mixed model and an empirical

Bayes estimator.

Slide 18

Linear Mixed Models

Y = Xβββ + Zb + ε

where b is N(0, σ2bΣΣΣb).

Xβββ are the “fixed effects” and Zb are the “random

effects.”

Henderson’s equations.

(βββ

b

)=

(XTX XTZ

ZTX ZTZ + λΣΣΣ−1b

)−1 (XTY

ZTY

).

λ =σ2

ε

σ2b

.

Semi 10

Slide 19

From previous slides:

Let X have row (WTi XT

i BT(Xi) ). Then

βββW

βββX

b

=

{X TX + λ blockdiag(0,0,D)}−1X TY.

Linear mixed model:(

βββ

b

)=

(XTX XTZ

ZTX ZTZ + λΣΣΣ−1b

)−1 (XTY

ZTY

)

={

(X Z )T (X Z ) + λ blockdiag(0, ΣΣΣ−1b )

}−1

(X Z )T Y

Slide 20

Selecting λ

1. cross-validation (CV)

2. generalized cross-validation (GCV)

3. ML or REML in mixed model framework

Semi 11

Slide 21

Selecting the Number of Knots

0 0.2 0.4 0.6 0.8 1−1

−0.5

0

0.5

1

1.5(a) SpaHet, j = 3, typical data set

y

Truefull−search

5 20 40 80 12095

100

105

110

115

K

rela

tive

MA

SE

(b) MASE comparisons

fixed nknotsmyopicfull−search

1 2 3 4 5 60

50

100

150

number of knots (coded)

freq

uenc

y

0 0.0125 0.0250

0.0125

0.025

ASE − K=5

AS

E −

K=

40

n = 200

Slide 22

0 0.2 0.4 0.6 0.8 1−0.5

0

0.5(a) SpaHetLS, j = 3, n = 2,000

y

Truefull−search

5 20 40 80 12095

100

105

110

115

K

rela

tive

MA

SE

(b) MASE comparisons

fixed nknotsmyopicfull−search

1 2 3 4 5 60

50

100

150

200

250

number of knots (coded)

freq

uenc

y

0 0.5 1 1.5

x 10−3

0

0.5

1

1.5x 10

−3

ASE − K=5

AS

E −

K=

40

n = 2, 000

Semi 12

Slide 23

0 5 10 15 20 250

1

2

x 10−4

dffit

(λ)

MS

E

MSE

Bias

Variance

Optimal

n = 10, 000, 20 knots, quadratic spline

Slide 24

Return to spinal bone mineral density study

age (years)

spin

al b

one

min

eral

den

sity

10 15 20 25

0.6

0.8

1.0

1.2

1.4

SBMDi,j = Ui + m(agei,j) + εi,j,

i = 1, . . . , m = 230, j = i, . . . , ni.

Semi 13

Slide 25 X =

1 age11...

...

1 age1n1

......

1 agem1...

...

1 agemnm

Slide 26 Z =

1 · · · 0 (age11 − κ1)+ · · · (age11 − κK)+

.... . .

......

. . ....

1 · · · 0 (age1n1− κ1)+ · · · (age1n1

− κK)+

......

......

. . ....

0 · · · 1 (agem1 − κ1)+ · · · (agem1 − κK)+

.... . .

......

. . ....

0 · · · 1 (agemnm− κ1)+ · · · (agemnm

− κK)+

Semi 14

Slide 27 u =

U1

...

Um

b1

...

bK

Slide 28

age (years)

spin

al b

one

min

eral

den

sity

10 15 20 25

0.6

0.8

1.0

Variability bars on m and estimated density of Ui

Semi 15

Slide 29

Broken down by ethnicity

0.6

0.8

1.0

1.2

1.4Asian

10 15 20 25

Black

Hispanic

0.6

0.8

1.0

1.2

1.4White

10 15 20 25

age (years)

spin

al b

one

min

eral

den

sity

Slide 30

Model with ethnicity effects

SBMDij = Ui + m(ageij) + β1blacki + β2hispanici

+β3whitei + εij, 1 ≤ j ≤ ni, 1 ≤ i ≤ m.

Asian is the reference group.

Semi 16

Slide 31

Only requires an expansion of the fixed effects by adding

the columns

black1 hispanic1 white1

......

...

black1 hispanic1 white1

......

...

blackm hispanicm whitem

......

...

blackm hispanicm whitem

Slide 32co

ntra

st w

ith A

sian

sub

ject

s0.

00.

050.

100.

15

Black Hispanic White

Semi 17

Slide 33

• In this model, the age effects curve for the four ethnic

groups are parallel.

• Could we model them as non-parallel?

• Might be problematic in this example because of the

small values of the ni.

• But the methodology should be useful in other

contexts.

Slide 34

• Add interactions between age and black, hispanic,

and white.

– These are fixed effects.

• Then add interactions between black, hispanic,

white, and asian and the linear plus functions in

age.

– These are mean-zero random effects with their own

variance component

– This variance component control the amount of

shrinkage of the enthicity-specific curves to the

overall effect.

Semi 18

Slide 35

Penalized Splines and Additive Models

Additive model:

Yi = m1(X1,i) + . . . + mP (XP,i) + εi

Slide 36

Bivariate additive spline model

Yi = β0+βx,1Xi+ bx,1(Xi−κx,1)++ · · ·+ bx,K(Xi−κx,Kx)+

+ βz,1Zi + bz,1(Zi − κz,1)+ + · · ·+ bz,K(Zi − κz,Kz)+ + εi

• no need for backfitting

• computation very rapid

• no identifiability issues

• inference is simple

Semi 19

Slide 37

Bayesian methods

The linear mixed model is half-Bayesian.

• The random effects have a prior.

• The parameters without a prior are:

– fixed effects

∗ give them diffuse normal priors

– variance components

∗ give them diffuse inverse gamma priors

Slide 38

Bayesian methods

Can be easily implemented in WinBUGS or programmed

in, say, MATLAB.

Allows Bayes rather than empirical Bayes inference.

• Uncertainty due to smoothing parameter selection is

taken into account.

Semi 20

Slide 39

The Bias-Variance Trade-off and Confidence Bands

.............................

...................

..........................................................................

............................

.

...

.

.......

..

.

.

.

.

.....

.

.

.

...

.

.

............

.

.

..

.

.

.

.

.

...

.

.

..

.

.

.

.

.

.

...

.

.

.

lambda=0

range

log

ratio

400 500 600 700

-0.8

-0.4

0.0

.............................

...................

..........................................................................

............................

.

...

.

.......

..

.

.

.

.

.....

.

.

.

...

.

.

............

.

.

..

.

.

.

.

.

...

.

.

..

.

.

.

.

.

.

...

.

.

.

lambda=10

range

log

ratio

400 500 600 700

-0.8

-0.4

0.0

.............................

...................

..........................................................................

............................

.

...

.

.......

..

.

.

.

.

.....

.

.

.

...

.

.

............

.

.

..

.

.

.

.

.

...

.

.

..

.

.

.

.

.

.

...

.

.

.

lambda=30

range

log

ratio

400 500 600 700

-0.8

-0.4

0.0

.............................

...................

..........................................................................

............................

.

...

.

.......

..

.

.

.

.

.....

.

.

.

...

.

.

............

.

.

..

.

.

.

.

.

...

.

.

..

.

.

.

.

.

.

...

.

.

.

lambda=1000

range

log

ratio

400 500 600 700

-0.8

-0.4

0.0

Slide 40

How does one adjust confidence intervals for bias?

• undersmooth — so variance dominates and bias can

be safetly ignored.

Semi 21

Slide 41

10−6

10−5

10−4

10−3

10−2

0

0.5

1

1.5

2

2.5

3

3.5

4

4.5

x 10−4

log(λ)

MS

E

MSE

Bias2 Variance

n=10,00020 knotsσ=.3

optimal

Slide 42

Adjustment for bias continued

• estimate bias by a higher order method and subtract

off bias (essentially the same as above)

• Wahba/Nychka Bayesian intervals

– bias is random so adds to posterior variance

– interval is widened but there is no “offset”.

Semi 22

Slide 43

Wahba/Nychka Bayesian Intervals

y = Xβββ + Zu + ε, Cov

[u

ε

]=

[σ2

uI 0

0 σ2εI

],

C = (X Z )

βββ and u are BLUPs.

Slide 44

Cov

([βββ

u

] ∣∣∣u)

= σ2ε(C

TC+ σ2ε

σ2uD)−1CTC(CTC+ σ2

ε

σ2uD)−1

(Frequentist variance. Ignores bias)

Cov

([βββ

u− u

])= σ2

ε(CTC + σ2

ε

σ2uD)−1.

(Bayesian posterior variance. Takes bias into account.)

Semi 23

Slide 45

age (million years)

stro

ntiu

m r

atio

95 100 105 110 115 120

0.70

720

0.70

725

0.70

730

0.70

735

0.70

740

0.70

745

0.70

750

Slide 46

Effect of measurement error

−4 −3 −2 −1 0 1 2 3 4 5−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x plus error

y

W = X + error and Var(X) = Var(error).

Semi 24

Slide 47

Correction for measurement error

Relatively little research in this area.

• Fan and Truong (1993): deconvolution kernels

– first work

– inefficient in finite-sample studies

– no inference

– strictly for 1-dimensional smoothing

• Carroll, Maca, Ruppert

– functional SIMEX methods and structural spline

methods

– more efficient than Fan and Truong

Slide 48

• Berry, Carroll, and Ruppert (JASA, 2002)

– fully Bayesian

– smoothing or penalized splines

– rather efficient in finite-sample studies

– inference available

– scales up — semiparametric inference is easy

– structural

Semi 25

Slide 49

Berry, Carroll, and Ruppert

• starts with mixed-model spline formulation

– but fully Bayesian

• conjugate priors

• true covariates are i.i.d. normal

– but surprisingly robust

• normal measurement error

• in Gibbs, only sampling of true (unknown) covariates

requires a Hastings-Metropolis step

Slide 50

Effect of measurement error

−4 −3 −2 −1 0 1 2 3 4 5−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

x plus error

y

W = X + error and Var(X) = Var(error).

Semi 26

Slide 51

Correction for measurement error

−4 −3 −2 −1 0 1 2 3 4

−1

−0.8

−0.6

−0.4

−0.2

0

0.2

0.4

0.6

0.8

1

Solid: true. Dotted: uncorrected. Dashed: corrected.

Slide 52

Measurement Error, continued

Ganguli, Staudenmayer, Wand:

• EM maximum likelihood estimation in BCR model.

• Works about as well as the fully Bayesian approach.

• Extension to additive models.

Semi 27

Slide 53

Generalized Regression

• Extension to non-Gaussian responses is conceptually

easy.

• Get a GLLM.

– However, GLIM’s are not trivial. Can use:

∗ Monte Carlo EM

∗ Or MCMC

Slide 54

Single-Index Models

Yi = g(XTi θθθ) + ZT

i βββ + εi.

Yu and Ruppert (2002, JASA).

Let

g(x) = γ0 + γ1x + · · ·+ γpxp

+c1(x− κ1)p+ + · · ·+ cK(x− κK)p

+.

Becomes a nonlinear regression model

Yi = m(Xi,Zi, θθθ, βββ, γγγ, c) + εi.

semi possible model semiparametric modeling, penalized ... · semi 11 slide 21 selecting the number...

Documents